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Abstract 

Suppose we have samples of a subset of a collection of random variables. No additional 
information is provided about the number of latent variables, nor of the relationship between 
the latent and observed variables. Is it possible to discover the number of hidden components, 
and to learn a statistical model over the entire collection of variables? We address this question 
in the setting in which the latent and observed variables are jointly Gaussian, with the condi- 
tional statistics of the observed variables conditioned on the latent variables being specified by 
a graphical model. As a first step we give natural conditions under which such latent-variable 
Gaussian graphical models are identifiable given marginal statistics of only the observed vari- 
ables. Essentially these conditions require that the conditional graphical model among the 
observed variables is sparse, while the effect of the latent variables is "spread out" over most 
of the observed variables. Next we propose a tractable convex program based on regularized 
maximum-likelihood for model selection in this latent-variable setting; the rcgularizer uses both 
the £\ norm and the nuclear norm. Our modeling framework can be viewed as a combination 
of dimensionality reduction (to identify latent variables) and graphical modeling (to capture 
remaining statistical structure not attributable to the latent variables), and it consistently es- 
timates both the number of hidden components and the conditional graphical model structure 
among the observed variables. These results are applicable in the high-dimensional setting in 
which the number of latent/observed variables grows with the number of samples of the observed 
variables. The geometric properties of the algebraic varieties of sparse matrices and of low-rank 
matrices play an important role in our analysis. 

Keywords: Gaussian graphical models; covariance selection; latent variables; regularization; 
sparsity; low-rank; algebraic statistics; high-dimensional asymptotics 



1 Introduction 

Statistical model selection in the high-dimensional regime arises in a number of applications. In 
many data analysis problems in geophysics, radiology, genetics, climate studies, and image pro- 
cessing, the number of samples available is comparable to or even smaller than the number of 
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variables. However, it is well-known that empirical statistics such as sample covariance matrices 
are not well-behaved when both the number of samples and the number of variables are large and 
comparable to each other (see [26]). Model selection in such a setting is therefore both challenging 
and of great interest. In order for model selection to be well-posed given limited information, a key 
assumption that is often made is that the underlying model to be estimated only has a few degrees 
of freedom. Common assumptions are that the data are generated according to a graphical model, 
or a stationary time-series model, or a simple factor model with a few latent variables. Sometimes 
geometric assumptions are also made in which the data are viewed as samples drawn according to 
a distribution supported on a low-dimensional manifold. 

A model selection problem that has received considerable attention recently is the estimation 
of covariance matrices in the high-dimensional setting. As the sample covariance matrix is poorly 
behaved in such a regime [20, 26], some form of regularization of the sample covariance is adopted 
based on assumptions about the true underlying covariance matrix. For example approaches based 
on banding the sample covariance matrix [3] have been proposed for problems in which the vari- 
ables have a natural ordering (e.g., times series), while "permutation-invariant" methods that use 
thresholding are useful when there is no natural variable ordering [4, 15]. These approaches provide 
consistency guarantees under various sparsity assumptions on the true covariance matrix. Other 
techniques that have been studied include methods based on shrinkage [24, 39] and factor analysis 
[16]. A number of papers have studied covariance estimation in the context of Gaussian graphical 
model selection. In a Gaussian graphical model the inverse of the covariance matrix, also called 
the concentration matrix, is assumed to be sparse, and the sparsity pattern reveals the conditional 
independence relations satisfied by the variables. The model selection method usually studied in 
such a setting is ^-regularized maximum-likelihood, with the l\ penalty applied to the entries of 
the inverse covariance matrix to induce sparsity. The consistency properties of such an estimator 
have been studied [22, 29, 32], and under suitable conditions [22, 29] this estimator is also "sparsis- 
tent", i.e., the estimated concentration matrix has the same sparsity pattern as the true model from 
which the samples are generated. An alternative approach to ^-regularized maximum-likelihood 
is to estimate the sparsity pattern of the concentration matrix by performing regression separately 
on each variable [27]; while such a method consistently estimates the sparsity pattern, it does not 
directly provide estimates of the covariance or concentration matrix. 

In many applications throughout science and engineering, a challenge is that one may not have 
access to observations of all the relevant phenomena, i.e., some of the relevant variables may be 
hidden or unobserved. Such a scenario arises in data analysis tasks in psychology, computational 
biology, and economics. In general latent variables pose a significant difficulty for model selection 
because one may not know the number of relevant latent variables, nor the relationship between 
these variables and the observed variables. Typical algorithmic methods that try to get around this 
difficulty usually fix the number of latent variables as well as the structural relationship between 
latent and observed variables (e.g., the graphical model structure between latent and observed 
variables), and use the EM algorithm to fit parameters [11]. This approach suffers from the problem 
that one optimizes non-convex functions, and thus one may get stuck in sub-optimal local minima. 
An alternative method that has been suggested is based on a greedy, local, combinatorial heuristic 
that assigns latent variables to groups of observed variables, based on some form of clustering of 
the observed variables [14]; however, this approach has no consistency guarantees. 

In this paper we study the problem of latent-variable graphical model selection in the setting 
where all the variables, both observed and hidden, are jointly Gaussian. More concretely let the 
covariance matrix of a finite collection of jointly Gaussian random variables Xo U Xh be denoted 
by H )i where Xo are the observed variables and Xh are the unobserved, hidden variables. The 
marginal statistics corresponding to the observed variables Xq are given by the marginal covariance 
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matrix So, which is simply a submatrix of the full covariance matrix S( G h)- However suppose that 
we parameterize our model by the concentration matrix K^q H ^ = S^J H y which as discussed above 
reveals the connection to graphical models. In such a parametrization, the marginal concentration 
matrix S^ 1 corresponding to the observed variables Xq is given by the Schur complement [19] with 
respect to the block Kh'- 

Ko = S^ 1 = Ko — Ko,hK]^Kh,o- 

Thus if we only observe the variables Xo, we only have access to So (or Ko)- The two terms that 
compose Ko above have interesting properties. The matrix Ko specifies the concentration matrix 
of the conditional statistics of the observed variables given the latent variables. If these conditional 
statistics are given by a sparse graphical model then Ko is sparse. On the other hand the matrix 
Ko,hKh Kh^o serves as a summary of the effect of marginalization over the hidden variables H. 
This matrix has small rank if the number of latent, unobserved variables H is small relative to the 
number of observed variables O (the rank is equal to \H\). Therefore the marginal concentration 
matrix Ko of the observed variables Xo is generally not sparse due to the additional low-rank 
term Ko,hKJj Kh,o- Hence standard graphical model selection techniques applied directly to the 
observed variables Xo are not useful. 

A modeling paradigm that infers the effect of the latent variables Xh would be more suitable 
in order to provide a simple explanation of the underlying statistical structure. Hence we decom- 
pose Ko into the sparse and low-rank components, which reveals the conditional graphical model 
structure in the observed variables as well as the number of and effect due to the unobserved latent 
variables. Such a method can be viewed as a blend of principal component analysis and graphical 
modeling. In standard graphical modeling one would directly approximate a concentration matrix 
by a sparse matrix in order to learn a sparse graphical model. On the other hand in principal com- 
ponent analysis the goal is to explain the statistical structure underlying a set of observations using 
a small number of latent variables (i.e., approximate a covariance matrix as a low-rank matrix). In 
our framework based on decomposing a concentration matrix, we learn a graphical model among 
the observed variables conditioned on a few (additional) latent variables. Notice that in our setting 
these latent variables are not principal components, as the conditional statistics (conditioned on 
these latent variables) are given by a graphical model. Therefore we refer to these latent variables 
informally as hidden components. 

Our first contribution in Section 3 is to address the fundamental question of identifiability of 
such latent-variable graphical models given the marginal statistics of only the observed variables. 
The critical point is that we need to tease apart the correlations induced due to marginalization over 
the latent variables from the conditional graphical model structure among the observed variables. 
As the identifiability problem is one of uniquely decomposing the sum of a sparse matrix and a 
low-rank matrix into the individual components, we study the algebraic varieties of sparse matrices 
and low-rank matrices. An important theme in this paper is the connection between the tangent 
spaces to these algebraic varieties and the question of identifiability. Specifically let Vl(Ko) denote 
the tangent space at Ko to the algebraic variety of sparse matrices, and let T(Ko,hKJj Kh,o) 
denote the tangent space at Ko,hKJj 1 Kh,o to the algebraic variety of low-rank matrices. Then 
the statistical question of identifiability of Ko and Ko,hKJ^Kh,o given Ko is determined by the 
geometric notion of transversality of the tangent spaces £l(Ko) and T{Ko,hKJj 1 Kh,o)- The study 
of the transversality of these tangent spaces leads us to natural conditions for identifiability. In 
particular we show that latent-variable models in which (1) the sparse matrix Ko has a small 
number of nonzeros per row/column, and (2) the low-rank matrix Ko,hK~h~Kh,o nas row/column 
spaces that are not closely aligned with the coordinate axes, are identifiable. These two conditions 
have natural statistical interpretations. The first condition ensures that there are no densely- 
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connected subgraphs in the conditional graphical model structure among the observed variables 
Xo given the hidden components, i.e., that these conditional statistics are indeed specified by a 
sparse graphical model. Such statistical relationships may otherwise be mistakenly attributed to 
the effect of marginalization over some latent variable. The second condition ensures that the effect 
of marginalization over the latent variables is "spread out" over many observed variables; thus, the 
effect of marginalization over a latent variable is not confused with the conditional graphical model 
structure among the observed variables. In fact the first condition is often assumed in some papers 
on standard graphical model selection without latent variables (see for example [29]). We note here 
that question of parameter identifiability was recently studied for models with discrete- valued latent 
variables (i.e., mixture models, hidden Markov models) [1]. However, this work is not applicable 
to our setting in which both the latent and observed variables are assumed to be jointly Gaussian. 

As our next contribution we propose a regularized maximum-likelihood decomposition framework 
to approximate a given sample covariance matrix by a model in which the concentration matrix 
decomposes into a sparse matrix and a low-rank matrix. A number of papers over the last several 
years have suggested that heuristics based on using the l\ norm are very effective for recovering 
sparse models [6, 12, 13]. Indeed such heuristics have been effectively used, as described above, 
for model selection when the goal is to estimate sparse concentration matrices. In her thesis [17] 
Fazel suggested a convex heuristic based on the nuclear norm for rank-minimization problems in 
order to recover low-rank matrices. This method generalized the previously studied trace heuris- 
tic for recovering low-rank positive semidefinite matrices. Recently several conditions have been 
given under which these heuristics provably recover low-rank matrices in various settings [7, 30]. 
Motivated by the success of these heuristics, we propose the following penalized likelihood method 
given a sample covariance matrix 5]q formed from n samples of the observed variables; 

(S n ,L n ) = argmin — £(S — L; £q) + A„ ( 7 ||S||i + tr(L)) 

S > L (1.1) 
s.t. S-LyO, L^O. 

Here i represents the Gaussian log-likelihood function and is given by £(K; S) = log det(K)—tr(KT,) 
for K >- 0, where tr is the trace of a matrix and det is the determinant. The matrix S n provides an 
estimate of K<j, which represents the conditional concentration matrix of the observed variables; the 
matrix L n provides an estimate of Ko^hKJ^Kho-, which represents the effect of marginalization 
over the latent variables. Notice that the regularization function is a combination of the l\ norm 
applied to S and the nuclear norm applied to L (the nuclear norm reduces to the trace over the cone 
of symmetric, positive-semidefinite matrices), with 7 providing a tradeoff between the two terms. 
This variational formulation is a convex optimization problem. In particular it is a regularized 
max-det problem and can be solved in polynomial time using standard off-the-shelf solvers [36]. 

Our main result in Section 4 is a proof of the consistency of the estimator (1.1) in the high- 
dimensional regime in which both the number of observed variables and the number of hidden 
components are allowed to grow with the number of samples (of the observed variables). We 
show that for a suitable choice of the regularization parameter A n , there exists a range of val- 
ues of 7 for which the estimates (S n ,L n ) have the same sparsity (and sign) pattern and rank as 
{Ko,Ko,h{Kh)~ 1 Kh,o) with high probability (see Theorem 4.1). The key technical requirement 
is an identifiability condition for the two components of the marginal concentration matrix Ko with 
respect to the Fisher information (see Section 3.4). We make connections between our condition 
and the irrepresentability conditions required for support/graphical-model recovery using l\ regu- 
larization [29, 40]. Our results provide numerous scaling regimes under which consistency holds in 
latent-variable graphical model selection. For example we show that under suitable identifiability 
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conditions consistent model selection is possible even when the number of samples and the number 
of latent variables are on the same order as the number of observed variables (see Section 4.3). 



Related previous work The problem of decomposing the sum of a sparse matrix and a low- 
rank matrix, with no additional noise, into the individual components was initially studied in 
[9] by a superset of the authors of the present paper. Specifically this work proposed a convex 
program using a combination of the t\ norm and the nuclear norm to recover the sparse and low- 
rank components, and derived conditions under which the convex program exactly recovers these 
components. In subsequent work Candes et al. [8] also studied this noise-free sparse-plus-low- 
rank decomposition problem, and provided guarantees for exact recovery using the convex program 
proposed in [9]. The problem setup considered in the present paper is quite different and is more 
challenging because we are only given access to an inexact sample covariance matrix, and we are 
interested in recovering components that preserve both the sparsity pattern and the rank of the 
components in the true underlying model. In addition to proving such a consistency result for 
the estimator (1.1), we also provide a statistical interpretation of our identifiability conditions and 
describe natural classes of latent-variable Gaussian graphical models that satisfy these conditions. 
As such our paper is closer in spirit to the many recent papers on covariance selection, but with 
the important difference that some of the variables are not observed. 

Outline Section 2 gives some background on graphical models as well as the algebraic varieties 
of sparse and low-rank matrices. It also provides a formal statement of the problem. Section 3 
discusses conditions under which latent-variable models are identifiable, and Section 4 states the 
main results of this paper. We provide experimental demonstration of the effectiveness of our 
estimator on synthetic and real data in Section 5. Section 6 concludes the paper with a brief 
discussion. The appendices include additional details and proofs of all of our technical results. 

2 Background and Problem Statement 

We briefly discuss concepts from graphical modeling and give a formal statement of the latent- 
variable model selection problem. We also describe various properties of the algebraic varieties of 
sparse matrices and of low-rank matrices. The following matrix norms are employed throughout 
this paper: 

• ||M||2: denotes the spectral norm, which is the largest singular value of M. 

• || M ll^: denotes the largest entry in magnitude of M. 

• \\M\\f- denotes the Frobenius norm, which is the square-root of the sum of the squares of the 
entries of M. 

• ||M||*: denotes the nuclear norm, which is the sum of the singular values of M. This reduces 
to the trace for positive-semidefinite matrices. 

• ||M||i: denotes the sum of the absolute values of the entries of M. 

A number of matrix operator norms are also used. For example, let Z : W xp — > W xp be a linear 
operator acting on matrices. Then the induced operator norm ^ is defined as: 



Z 
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Therefore, ||^[| jr__^^ denotes the spectral norm of the matrix operator Z. The only vector norm 
used is the Euclidean norm, which is denoted by || • ||. 

2.1 Gaussian graphical models with latent variables 

A graphical model [23] is a statistical model defined with respect to a graph (V, 6) in which the 
nodes index a collection of random variables {X v } v& y, and the edges represent the conditional 
independence relations (Markov structure) among the variables. The absence of an edge between 
nodes i,j 6 V implies that the variables Xi,Xj are independent conditioned on all the other 
variables. A Gaussian graphical model (also commonly referred to as a Gauss-Markov random 
field) is one in which all the variables are jointly Gaussian [33]. In such models the sparsity 
pattern of the inverse of the covariance matrix, or the concentration matrix, directly corresponds 
to the graphical model structure. Specifically, consider a Gaussian graphical model in which the 
covariance matrix is given by £ >- and the concentration matrix is given by K = Then an 
edge {i, j} € £ is present in the underlying graphical model if and only if Kij ^ 0. 

Our focus in this paper is on Gaussian models in which some of the variables may not be 
observed. Suppose O represents the set of nodes corresponding to observed variables Xo, and 
H the set of nodes corresponding to unobserved, hidden variables Xh with O U H = V and 
OnH = 0. The joint covariance is denoted by T,r H ), and joint concentration matrix by K( H ) = 
S^j H y The submatrix So represents the marginal covariance of the observed variables Xo, and 
the corresponding marginal concentration matrix is given by the Schur complement with respect 
to the block Kh'- 

K = E5 1 = K - Ko^K^K^o- (2.2) 

The submatrix K<j specifies the concentration matrix of the conditional statistics of the observed 
variables conditioned on the hidden components. If these conditional statistics are given by a 
sparse graphical model then Ko is sparse. On the other hand the marginal concentration matrix 
Kq of the marginal distribution of Xo is not sparse in general due to the extra correlations induced 
from marginalization over the latent variables Xh, i.e., due to the presence of the additional term 
Ko,hK h x Kh,o- Hence, standard graphical model selection techniques in which the goal is to 
approximate a sample covariance by a sparse graphical model are not well-suited for problems in 
which some of the variables are hidden. However, the matrix Ko,hK h 1 Kh,o is a low-rank matrix 
if the number of hidden variables is much smaller than the number of observed variables (i.e., 
\H\ <C |0|)- Therefore, a more appropriate model selection method is to approximate the sample 
covariance by a model in which the concentration matrix decomposes into the sum of a sparse 
matrix and a low-rank matrix. The objective here is to learn a sparse graphical model among the 
observed variables conditioned on some latent variables, as such a model explicitly accounts for the 
extra correlations induced due to unobserved, hidden components. 

2.2 Problem statement 

In order to analyze latent-variable model selection methods, we need to define an appropriate no- 
tion of model selection consistency for latent-variable graphical models. Notice that given the two 
components Ko and Ko,hK h 1 Kh,o of the concentration matrix of the marginal distribution (2.2), 
there are infinitely many configurations of the latent variables (i.e., matrices Kh >- 0,Ko,h = 
K^q) that give rise to the same low-rank matrix Ko^hK^Kh^o- Specifically for any non-singular 
matrix B S Rl^l x l^l, one can apply the transformations Kh — > BKhB t , Ko,h Ko,hB t and 
still preserve the low-rank matrix Ko,hK h x Kh,o- I n a ^ of these models the marginal statis- 
tics of the observed variables Xq remain the same upon marginalization over the latent variables 
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Xh- The key invariant is the low-rank matrix Ko,hK^ Kh,o, which summarizes the effect of 
marginalization over the latent variables. These observations give rise to the following notion of 
consistency: 

Definition 2.1. A pair of (symmetric) matrices (S,L) with S, L € Rl°l x l c, l is an algebraically 
consistent estimate of a latent-variable Gaussian graphical model given by the concentration matrix 
K(q jj\ if the following conditions hold: 

1. The sign-pattern of S is the same as that of Ko- 

sign(Sij) = sign((K )jj), Vi, j. 
Here we assume that sign(O) = 0. 

2. The rank of L is the same as the rank of Ko,hKJj 1 Kh,o : 

rank(L) = iaxik(Ko ! HK~^~KH,o)- 

3. The concentration matrix S — L can be realized as the marginal concentration matrix of an 
appropriate latent-variable model: 

S-LyO, LhO- 

The first condition ensures that S provides the correct structural estimate of the conditional 
graphical model (given by Ko) of the observed variables conditioned on the hidden components. 
This property is the same as the "sparsistency" property studied in standard graphical model 
selection [22, 29]. The second condition ensures that the number of hidden components is correctly 
estimated. Finally, the third condition ensures that the pair of matrices (S, L) leads to a realizable 
latent-variable model. In particular this condition implies that there exists a valid latent-variable 
model on \0 U H\ variables in which (a) the conditional graphical model structure among the 
observed variables is given by S, (b) the number of latent variables \H\ is equal to the rank of L, 
and (c) the extra correlations induced due to marginalization over the latent variables is equal to L. 
Any method for matrix factorization (see for example, [38]) can be used to factorize the low-rank 
matrix L, depending on the properties that one desires in the factors (e.g., sparsity). 

We also study parametric consistency in the usual sense, i.e., we show that one can produce 
estimates (S,L) that converge in various norms to the matrices (Ko,Ko,hK]j Kh,o)- Notice that 
proving (S,L) is close to (Ko, Ko,hK]j~Kh,o) i n some norm does not in general imply that the 
support/sign-pattern and rank of (S,L) are the same as those of (Ko, Ko.hKJj 1 Kh,o)- Therefore 
parametric consistency is different from algebraic consistency, which requires that (S, L) have the 
same support/sign-pattern and rank as (Ko,Ko,uK^Ku,o)- 

Goal Let K? H ^ denote the concentration matrix of a Gaussian model. Suppose that we have 

n samples {X }^ =1 of the observed variables Xo- We would like to produce estimates (S n ,L n ) 
that, with high-probability, are both algebraically consistent and parametrically consistent (in some 
norm) . 



7 



2.3 Likelihood function and Fisher information 

Given n samples {X l }f =1 of a finite collection of jointly Gaussian zero-mean random variables with 
concentration matrix K*, we define the sample covariance as follows: 

1 n 

E n ^-YXiXf. (2.3) 
i=i 

It is then easily seen that the log-likelihood function is given by: 

£(K; S n ) = log det(K) - tr(K£ n ), (2.4) 

where l(K; E n ) is a function of K. Notice that this function is strictly concave for K y 0. Now 
consider the latent-variable modeling problem in which we wish to model a collection of random 
variables Xo (with sample covariance T,q) by adding some extra variables Xh- With respect to 
the parametrization (S,L) (with S representing the conditional statistics of Xo given Xh, and L 
summarizing the effect of marginalization over the additional variables Xh), the likelihood function 
is given by: 

£(S,L;^)=£(S - L;S5). 

The function I is jointly concave with respect to the parameters {S, L) whenever S — L y 0, and it 
is this function that we use in our variational formulation (1.1) to learn a latent-variable model. 

In the analysis of a convex program involving the likelihood function, the Fisher information 
plays an important role as it is the negative of the Hessian of the likelihood function and thus 
controls the curvature. As the first term in the likelihood function is linear, we need only study 
higher-order derivatives of the log-determinant function in order to compute the Hessian. Letting 
X denote the Fisher information matrix, we have that [5] 

X(K*) 4 -V 2 K \ogdet{K)\ K=K * = (K*)- 1 ® (*"T\ 

for K* y 0. If K* is a p x p concentration matrix, then the Fisher information matrix X(K*) has 
dimensions p 2 xp 2 . Next consider the latent-variable situation with the variables indexed by O being 
observed and the variables indexed by H being hidden. The concentration matrix K = (Sq) -1 
of the marginal distribution of the observed variables O is given by the Schur complement (2.2), 
and the corresponding Fisher information matrix is given by 

X{K* ) = (K o y l ® {R y l = Y,* ® T? . 

Notice that this is precisely the \0\ 2 x \0\ 2 submatrix of the full Fisher information matrix 
X{KJ Q = Yi*^ H ^ <g> H ^ with respect to all the parameters KJ Q H ^ = (S* #)) -1 (corre- 
sponding to the situation in which all the variables Xouh are observed). The matrix X(K? H -A 

has dimensions |OU-ff| 2 x \OL)H\ 2 , while X(Kq) is an \0\ 2 x \0\ 2 matrix. To summarize, we have 
for all i,j, k,l G O that: 

X ( K o)(i,j),{k,l) = P(0 H) ® S (0 H)](i,j),{k,l) = I ( K (0 H))(i,j),(k,l)- 

In Section 3.4 we impose various conditions on the Fisher information matrix I(K ) under which 
our regularized maximum-likelihood formulation provides consistent estimates with high probabil- 
ity. 
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2.4 Algebraic varieties of sparse and low-rank matrices 

An algebraic variety is the solution set of a system of polynomial equations. The set of sparse 
matrices and the set of low-rank matrices can be naturally viewed as algebraic varieties. Here we 
describe these varieties, and discuss some of their properties. Of particular interest in this paper are 
geometric properties of these varieties such as the tangent space and local curvature at a (smooth) 
point. 

Let S(k) denote the set of matrices with at most k nonzeros: 

S(k) 4 {M G W xp | |support(M)| < k}. (2.5) 

The set S(k) is an algebraic variety, and can in fact be viewed as a union of ft ) subspaces 
in W xp . This variety has dimension k, and it is smooth everywhere except at those matrices 
that have support size strictly smaller than k. For any matrix M 6 ]R pxp , consider the variety 
<S(|support(M)|); M is a smooth point of this variety, and the tangent space at M is given by 

O(M) = {N € R pxp | support(iV) C support(M)}. (2.6) 

In words the tangent space Q(M) at a smooth point M is given by the set of all matrices that have 
support contained within the support of M. We view Q(M) as a subspace in ]R pxp . 
Next let C(r) denote the algebraic variety of matrices with rank at most r: 

C(r) = {M E R pxp | rank(M) < r}. (2.7) 

It is easily seen that C(r) is an algebraic variety because it can be defined through the vanishing 
of all (r + 1) x (r + 1) minors. This variety has dimension equal to r(2p — r), and it is smooth 
everywhere except at those matrices that have rank strictly smaller than r. Consider a rank-r 
matrix M with SVD given by M = UDV T , where U,V € M. pxr and D £ R rxr . The matrix M is a 
smooth point of the variety £(rank(M)), and the tangent space at M with respect to this variety 
is given by 

T{M) = {UY? + Y 2 V T | Y U Y 2 € M. pxr }. (2.8) 

In words the tangent space T{M) at a smooth point M is the span of all matrices that have either 
the same row-space as M or the same column-space as M. As with f2(M) we view T(M) as a 
subspace in M pxp . 

In Section 3 we explore the connection between geometric properties of these tangent spaces 
and the identifiability problem in latent-variable graphical models. 

2.5 Curvature of rank variety 

The sparse matrix variety S(k) has the property that it has zero curvature at any smooth point. 
Consequently the tangent space at a smooth point M is the same as the tangent space at any 
point in a neighborhood of M. This property is implicitly used in the analysis of l\ regularized 
methods for recovering sparse models. The situation is more complicated for the low-rank matrix 
variety, because the curvature at any smooth point is nonzero. Therefore we need to study how the 
tangent space changes from one point to a neighboring point by analyzing how this variety curves 
locally. Indeed the amount of curvature at a point is directly related to the "angle" between the 
tangent space at that point and the tangent space at a neighboring point. For any subspace T of 
matrices, let Vt denote the projection onto T. Given two subspaces T\ % T<i of the same dimension, 
we measure the "twisting" between these subspaces by considering the following quantity. 

p{T 1 ,T 2 ) = \\Vt 1 -Vt 2 \\2^2= max \\\V Tl - Vt 2 }(N)\\ 2 . (2.9) 

\\N\\ 2 <1 
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In Appendix A we briefly review relevant results from matrix perturbation theory; the key 
tool used to derive these results is the resolvent of a matrix [21]. Based on these tools we prove 
the following two results in Appendix B, which bound the twisting between the tangent spaces at 
nearby points. The first result provides a bound on the quantity p between the tangent spaces at 
a point and at its neighbor. 

Proposition 2.1. Let M £ W pxp be a rank-r matrix with smallest nonzero singular value equal to 
a, and let A be a perturbation to M such that ||A||2 < § . Further, let M + A be a rank-r matrix. 
Then we have that 

p(T(M + A), T(Af )) < - ||A|| 2 . 

a 

The next result bounds the error between a point and its neighbor in the normal direction. 

Proposition 2.2. Let M £ MP xp be a rank-r matrix with smallest nonzero singular value equal to 
a, and let A. be a perturbation to M such that ||A|| < ^. Further, let M + A be a rank-r matrix. 
Then we have that 2 

||p T(M)± (A)|| 2 <^. 

These results suggest that the closer the smallest singular value is to zero, the more curved the 
variety is locally. Therefore we control the twisting between tangent spaces at nearby points by 
bounding the smallest nonzero singular value away from zero. 



3 Identifiability 

In the absence of additional conditions, the latent-variable model selection problem is ill-posed. In 
this section we discuss a set of conditions on latent-variable models that ensure that these models 
are identifiable given marginal statistics for a subset of the variables. 



3.1 Structure between latent and observed variables 

Suppose that the low-rank matrix that summarizes the effect of the hidden components is itself 
sparse. This leads to identifiability issues in the sparse-plus-low-rank decomposition problem. 
Statistically the additional correlations induced due to marginalization over the latent variables 
could be mistaken for the conditional graphical model structure of the observed variables. In order 
to avoid such identifiability problems the effect of the latent variables must be "diffuse" across 
the observed variables. To address this point the following quantity was introduced in [9] for any 
matrix M, defined with respect to the tangent space T(M): 

£(T(M)) 4 max \\N\ln. (3.1) 

NeT(M), \\N\\ 2 <1 

Thus £(T(M)) being small implies that elements of the tangent space T(M) cannot have their 
support concentrated in a few locations; as a result M cannot be too sparse. This idea is formalized 
in [9] by relating £(T(M)) to a notion of "incoherence" of the row/column spaces, where the 
row/column spaces are said to be incoherent with respect to the standard basis if these spaces are 
not aligned closely with any of the coordinate axes. Letting M = UDV T be the singular value 
decomposition of M, the incoherence of the row/column spaces of M (initially proposed and studied 
by Candes and Recht [7]) is defined as: 

inc(M) = max{max||P t/ (e i )||,max||iV(ei)||}. (3.2) 
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Here Pv,Pu denote projections 1 onto the row/column spaces of M, and e, is the i'th standard 
basis vector. Hence inc(M) measures the projection of the most "closely aligned" coordinate axis 
with the row/column spaces. For any rank-r matrix M we have that 



where the lower bound is achieved (for example) if the row/column spaces span any r columns of 
a p x p orthonormal Hadamard matrix, while the upper bound is achieved if the row or column 
space contains a standard basis vector. Typically a matrix M with incoherent row/column spaces 
would have inc(M) <C 1. The following result (proved in [9]) shows that the more incoherent the 
row/column spaces of M, the smaller is £(M). 

Proposition 3.1. For any M £ M pxp ; we have that 



where £(T(M)) and inc(M) are defined in (3.1) and (3.2). 

Based on these concepts we roughly require that the low-rank matrix that summarizes the 
effect of the latent variables be incoherent, thereby ensuring that the extra correlations due to 
marginalization over the hidden components cannot be confused with the conditional graphical 
model structure of the observed variables. Notice that the quantity inc is not just a measure 
of the number of latent variables, but also of the overall effect of the correlations induced by 
marginalization over these variables. 

Curvature and change in £: As noted previously an important technical point is that the 
algebraic variety of low-rank matrices is locally curved at any smooth point. Consequently the 
quantity £ changes as we move along the low-rank matrix variety smoothly. The quantity p{T\,T2) 
introduced in (2.9) also allows us to bound the variation in £ as follows. 

Lemma 3.2. Let Ti,T2 be two matrix subspaces of the same dimension with the property that 
p(T\,T2) < 1, where p is defined in (2.9). Then we have that 



This lemma is proved in Appendix B. 
3.2 Structure among observed variables 

An identifiability problem also arises if the conditional graphical model among the observed vari- 
ables contains a densely connected subgraph. These statistical relationships might be mistaken as 
correlations induced by marginalization over latent variables. Therefore we need to ensure that 
the conditional graphical model among the observed variables is sparse. We impose the condition 
that this conditional graphical model must have small "degree", i.e., no observed variable is directly 
connected to too many other observed variables conditioned on the hidden components. Notice that 
bounding the degree is a more refined condition than simply bounding the total number of nonzeros 

1 We denote projections onto vector subspaces (defined by a matrix) by P, and projections onto matrix subspaces 
(defined by a general linear operator) by the calligraphic V. 




(3.3) 



inc(M) < f (T(M)) < 2 inc(M) 
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as the sparsity pattern also plays a role. In [9] the authors introduced the following quantity in 
order to provide an appropriate measure of the sparsity pattern of a matrix: 

u(n(M)) = max \\N\\ 2 . (3.4) 

iVen(M),||M|| 00 <i 

The quantity /u(f2(M)) being small for a matrix implies that the spectrum of any element of the 
tangent space 0(M) is not too "concentrated", i.e., the singular values of the elements of the 
tangent space are not too large. In [9] it is shown that a sparse matrix M with "bounded degree" 
(a small number of nonzeros per row/column) has small fj,(M). 

Proposition 3.3. Let M £ W xp be any matrix with at most deg max (M) nonzero entries per 
row/column, and with at least deg min (M) nonzero entries per row/column. With /x(f2(M)) as 
defined in (3.4), we have that 

deg min (M) < /i(n(M)) < deg max (M). 
3.3 Transversality of tangent spaces 

Suppose that we have the sum of two vectors, each from two known subspaces. It is possible to 
uniquely recover the individual vectors from the sum if and only if the subspaces have a transverse 
intersection, i.e., they only intersect at the origin. This simple observation leads to an appealing 
algebraic notion of identifiability. Consider the situation in which we have the sum of a sparse 
matrix and a low-rank matrix. In addition to this sum, suppose that we are also given the tangent 
spaces at these matrices with respect to the algebraic varieties of sparse and low-rank matrices 
respectively. Then a necessary and sufficient condition for local identifiability is that these tangent 
spaces have a transverse intersection. It turns out that these transversality conditions on the 
tangent spaces are also sufficient for the regularized maximum-likelihood convex program (1.1) to 
provide consistent estimates of the number of hidden components and the conditional graphical 
model structure of the observed variables conditioned on the latent variables (without any side 
information about the tangent spaces). 

In order to quantify the level of transversality between the tangent spaces and T we study 
the minimum gain with respect to some norm of the addition operator restricted to the cartesian 
product y = £1 x T. More concretely let A : M. pxp x W xp —?■ M. pxp represent the addition operator, 
i.e., the operator that adds two matrices. Then given any matrix norm || • \\ q on W xp x M ?)Xp , the 
minimum gain of A restricted to y is defined as follows: 

e(Q,T, || • II,) 4 min \\VyAUVy{S,L)\\ q , 

(S,L)eUxT, ||(S,L)||, = 1 

where Vy denotes the projection onto the space y, and A^ denotes the adjoint of the addition 
operator (with respect to the standard Euclidean inner-product). The tangent spaces O and T 
have a transverse intersection if and only if e(0,T, || • > 0. The "level" of transversality is 
measured by the magnitude of e(f2,T, || • |L). Note that if the norm || • |L used is the Frobenius 
norm, then e(f2,T, || ■ \\f) is the square of the minimum singular value of the addition operator A 
restricted to Vt x T. 

A natural norm with which to measure transversality is the dual norm of the regularization 
function in (1.1), as the subdifferential of the regularization function is specified in terms of its 
dual. The reasons for this will become clearer as we proceed through this paper. Recall that the 
regularization function used in the variational formulation (1.1) is given by: 

fj{S,L) = j\\S\\i + ||L||*, 
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where the nuclear norm || • ||* reduces to the trace function over the cone of positive-semidefinite 
matrices. This function is a norm for all 7 > 0. The dual norm of / 7 is given by 



The following simple lemma records a useful property of the g 7 norm that is used several times 
throughout this paper. 

Lemma 3.4. Let f2 and T be tangent spaces at any points with respect to the algebraic varieties of 
sparse and low-rank matrices. Then for any matrix M, we have that ||7-'n(-^')||oo < H-^lloo o,nd that 
\\Vt{M)\\ 2 < 2||M|| 2 . Further we also have that ||7VWII°o < and that \\V T ±(M)\\ 2 < 

\\M\\ 2 . Thus for any matrices M,N and for y = 0, x T, one can check that g^(Vy(M, N)) < 
2g 7 (M,N) and that g 7 (V y ±(M,N)) < g 7 (M,N). 

Next we define the quantity x(^> T, 7) as follows in order to study the transversality of the 
spaces and T with respect to the g 7 norm: 



Here \x and £ are defined in (3.4) and (3.1). We then have the following result (proved in Ap- 
pendix C): 

Lemma 3.5. Let S G 0,L G T be matrices such that \\S\Iqo = 7 and let \\L\\2 = 1- Then we 
have that g^VyA^ AVy{S, L)) G [1 - x(fi. T, 7), 1 + x(fl, T, 7)], where y = QxT and x(fi,T,7) is 
defined in (3.5). In particular we have that 1 — x(^)^W) < e (Q,T, g~ t ). 

The quantity x(^) T, 7) being small implies that the addition operator is essentially isometric 
when restricted to y = x T. Stated differently the magnitude of x(^> T, 7) is a measure of the 
level of transversality of the spaces and T. If /i(fi)£(T) < \ then 7 G 2 J(q) ) ensures that 

x(0, T, 7) < 1, which in turn implies that the tangent spaces O and T have a transverse intersection. 

Observation: Thus we have that the smaller the quantities /i(O) and £(T), the more transverse 
the intersection of the spaces Q and T. 

3.4 Conditions on Fisher information 

The main focus of Section 4 is to analyze the regularized maximum-likelihood convex program 
(1.1) by studying its optimality conditions. The log-likelihood function is well-approximated in a 
neighborhood by a quadratic form given by the Fisher information (which measures the curvature, 
as discussed in Section 2.3). Let I* = T{Kq) denote the Fisher information evaluated at the true 
marginal concentration matrix Kq = Kq — K* Q H {K* H )~ l K* H Q , where K* Q H ^ represents the con- 
centration matrix of the full model (see equation (2.2)). The appropriate measure of transversality 
between the tangent spaces 2 £2 = Q,{Kq) and T = T(Kq H (K^)^ l K"^ Q ) is then in a space in 
which the inner-product is given by T* . Specifically, we need to analyze the minimum gain of the 
operator VyA*Z*AVy restricted to the space y = 0, x T. Therefore we impose several conditions 
on the Fisher information I* . We define quantities that control the gains of I* restricted to O and 
T separately; these ensure that elements of O and elements of T are individually identifiable under 
the map I*. In addition we define quantities that, in conjunction with bounds on fJ,(£l) and £(T), 
allow us to control the gain of I* restricted to the direct-sum £1 ® T. 

2 We implicitly assume that these tangent spaces are subspaces of the space of symmetric matrices. 





(3.5) 
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I* restricted to Q: The minimum gain of the operator V^X*Vq. restricted to ft is given by 



a n ± min WVnTV^M)^. 

A/6Q,||M|| 00 = l 

The maximum effect of elements in Q, in the orthogonal direction Q, 1 - is given by 

5 Q 4 max WVnxTVniM)^. 

MGf2,||M||oo=l 

The operator X* is injective on ft if an > 0. The ratio ^ < 1 — v implies the irrepresentability con- 
dition imposed in [29], which gives a sufficient condition for consistent recovery of graphical model 
structure using ^-regularized maximum- likelihood. Notice that this condition is a generalization 
of the usual Lasso irrepresentability conditions [40], which are typically imposed on the covariance 
matrix. Finally we also consider the following quantity, which controls the behavior of X* restricted 
to Q, in the spectral norm: 

B u = max ||Z*(M)|| 2 . 
Men,\\M\\ 2 =i 

Z* restricted to T: Analogous to the case of Q one could control the gains of the operators 
V t i_T*Vt and VtX*Vt- However as discussed previously one complication is that the tangent 
spaces at nearby smooth points on the rank variety are in general different, and the amount of 
twisting between these spaces is governed by the local curvature. Therefore we control the gains of 
the operators Vt>xX*Vt' and Vt'X*Vt i for all tangent spaces T" that are "close to" the nominal T 
(at the true underlying low-rank matrix), measured by p(T,T') (2.9) being small. The minimum 
gain of the operator Vt'X*Vt' restricted to T' (close to T) is given by 

ax — min min \\Vt , X*'Pt' (M)\\2. 

p(T',T)<i22 MeT',\\M\\ 2 =l 

Similarly the maximum effect of elements in T' in the orthogonal direction T' 1 - (for T' close to T) 
is given by 

5t — max max \\V t >±I*T't'(M)\\2. 

p(T',T)<Z ( p M€T',\\M\\ 2 =1 

Implicit in the definition of ay and St is the fact that the outer minimum and maximum are only 
taken over spaces T 1 that are tangent spaces to the rank-variety. The operator X* is injective on all 
tangent spaces T' such that p(T',T) < ^1 if aT > o. An irrepresentability condition (analogous 
to those developed for the sparse case) for tangent spaces near T to the rank variety would be that 
^ < 1 — v. Finally we also control the behavior of X* restricted to T' close to T in the norm: 

Bt — max max ||I*(M)|| 00 . 

p ( T / )T )<im A/eT',||Af|| 00 =l 

The two sets of quantities (an,5n) and (ar^r) essentially control how I* behaves when re- 
stricted to the spaces £1 and T separately (in the natural norms). The quantities Bq and Bt are 
useful in order to control the gains of the operator X* restricted to the direct sum £1 © T. Notice 
that although the magnitudes of elements in O are measured most naturally in the norm, the 
quantity Bq, is specified with respect to the spectral norm. Similarly elements of the tangent spaces 
T' to the rank variety are most naturally measured in the spectral norm, but Bt provides control in 
the ioo norm. These quantities, combined with fj>(£l) and £(T) (defined in (3.4) and (3.1)), provide 
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the "coupling" necessary to control the behavior of X* restricted to elements in the direct sum 
© T. In order to keep track of fewer quantities, we summarize the six quantities as follows: 

a = min(aQ,aj') 
5 = max((fo, <5t) 
P = max(/3 n ,/3 T ). 

Main assumption There exists a v £ (0, \\ such that: 

5 

-<l-2u. 

a 

This assumption is to be viewed as a generalization of the ir represent ability conditions imposed 
on the covariance matrix [40] or the Fisher information matrix [29] in order to provide consistency 
guarantees for sparse model selection using the l\ norm. With this assumption we have the following 
proposition, proved in Appendix C, about the gains of the operator X* restricted to © T. This 
proposition plays a fundamental role in the analysis of the performance of the regularized maximum- 
likelihood procedure (1.1). 

Proposition 3.6. Let £1 and T be the tangent spaces defined in this section, and let I* be the 
Fisher information evaluated at the true marginal concentration matrix. Further let a, fi,v be as 
defined above. Suppose that 

Kmm < \ 1 



6 \f3(2-v) 
and that 7 is in the following range: 

"3/3(2 - v)£(T) va 



7 e 



va ' 2/3(2 -i/)/x(n)_ 

Then we have the following two conclusions for y = £1 x T' with p(T',T) < 2^' 
1. The minimum gain of T* restricted toCKBT' is bounded below: 



a 



min gJV y A^TAPy{S,L)) > 

(S,L)ey, [151100=7, ||£[|„=i 7 y 2 

Specifically this implies that for all (S, L) £ y 

g^VyA^*AVy(S,L)) > ~gy(S,L). 

2. The effect of elements iny = fixT' on the orthogonal complement y^ = Q 1 - x T /J - is bounded 
above: 

Vy±A ] I*AVy (VyrfTAVy) ~* < 1 - U. 

Specifically this implies that for all (S, L) £ y 

gj (V y xA^l*AVy{S,L)) < (1 - v) gi {VyA^T*AVy{S,L)). 

The last quantity we consider is the spectral norm of the margins! covaxiancc matrix — 

^±\\X* \\ 2 = \\(K* )-%. (3 . 6) 

A bound on ift is useful in the probabilistic component of our analysis, in order to derive convergence 
rates of the sample covariance matrix to the true covariance matrix. We also observe that 
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4 Regularized Maximum-Likelihood Convex Program and Con- 
sistency 

4.1 Setup 

Let KJq H j denote the full concentration matrix of a collection of zero-mean jointly-Gaussian 
observed and latent variables, let p = \0\ denote the number of observed variables, and let h = 
\H\ denote the number of latent variables. We are given n samples {X }f =1 of the observed 
variables Xo- We consider the high-dimensional setting in which (p,h,n) are all allowed to grow 
simultaneously. The quantities a, f3,v,ip defined in the previous section are accounted for in our 
analysis, although we suppress the dependence on these quantities in the statement of our main 
result. We explicitly keep track of the quantities h(£1(Kq)) and £,(T(K Q H (K* H )~ l K* H Q )) as these 
control the complexity of the latent-variable model given by K* Q H y In particular /x controls 
the sparsity of the conditional graphical model among the observed variables, while £ controls the 
incoherence or "diffusivity" of the extra correlations induced due to marginalization over the hidden 
variables. Based on the tradeoff between these two quantities, we obtain a number of classes of 
latent-variable graphical models (and corresponding scalings of (p,h,n)) that can be consistently 
recovered using the regularized maximum-likelihood convex program (1.1) (see Section 4.3 for 
details). Specifically we show that consistent model selection is possible even when the number 
of samples and the number of latent variables are on the same order as the number of observed 
variables. We present our main result next demonstrating the consistency of the estimator (1.1), 
and then discuss classes of latent-variable graphical models and various scaling regimes in which 
our estimator is consistent. 

4.2 Main results 

Given n samples {X }^ =1 of the observed variables Xo, the sample covariance is defined as: 

1 n 

i=i 

As discussed in Section 2.2 the goal is to produce an estimate given by a pair of matrices (S,L) 
of the latent-variable model represented by H y We study the consistency properties of the 
following regularized maximum-likelihood convex program: 

(S n ,L n ) = argmintr[(5-L) Eg] - log det(5 - L) + Kh\\S\\i + tr(L)] 

S,L ( 41 ) 

s.t. S-LyO, L>rO. 

Here A n is a regularization parameter, and 7 is a tradeoff parameter between the rank and sparsity 
terms. Notice from Proposition 3.6 that the choice of 7 depends on the values of /j,(U(K )) and 
£(T(Kq h(Kjj )~ 1 Kj I q))\ essentially these quantities correspond to the degree of the conditional 
graphical model structure of the observed variables and the incoherence of the low-rank matrix 
summarizing the effect of the latent variables (see Section 3). While these quantities may not 
be known a priori, we discuss a method to choose 7 numerically in our experimental results (see 
Section 5). The following theorem shows that the estimates (S n ,L n ) provided by the convex 
program (4.1) are consistent for a suitable choice of X n . In addition to the appropriate identifiability 
conditions (as specified by Proposition 3.6), we also impose lower bounds on the minimum nonzero 
entry of the sparse conditional graphical model matrix K Q and on the minimum nonzero singular 
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value of the low-rank matrix Kq H (K^) 1 K* HO summarizing the effect of the hidden variables. 
We suppress the dependence 3 on a, j3, u, tp as we assume that these quantities remain bounded 
and do not scale with the other parameters. We emphasize the dependence on h(Q(Kq)) and 
£(T '(Kq H (K^)~ l K* H )) because these control the complexity of the underlying latent-variable 
graphical model as discussed above. 

Theorem 4.1. Let K* H ^ denote the concentration matrix of a Gaussian model. We have n sam- 
ples {Xq}™ =1 of the p observed variables denoted byO. LetQ, = VL{Kq) andT = T{Kq ^(K^) -1 K* h q ) 
denote the tangent spaces at Kq and at K Q h (K^)~ 1 K^ with respect to the sparse and low-rank 
matrix varieties respectively. 

Assumptions: Suppose that the following conditions hold: 

1. The quantities /u(f2) and £(T) satisfy the assumption of Proposition 3.6 for identifiability, and 
7 is chosen in the range specified by Proposition 3.6. 

2. The number of samples n available is such that 

> P 
n ~ 

~ £(T) 4 



3. The regularization parameter X n is chosen as 



An 



1 



ecr) V n 

4- The minimum nonzero singular value a of K Q H (K^)~ l K* H q is bounded as 



1 



Ip 



a ~ ti(T) 3 \J n 



5. The minimum magnitude nonzero entry 9 of K Q is bounded as 

1 rp 



~ e(r)/*(n)V"' 



Conclusions: Then with probability greater than 1 — 2exp{— p} we have: 

1. Algebraic consistency: The estimate (S n , L n ) given by the convex program (4.1) is algebraically 
consistent, i.e., the support and sign pattern of S n is the same as that of K Q , and the rank 
of L n is the same as that of K* Q H {K* H )~ l K* H . 

2. Parametric consistency: The estimate (S n ,L n ) given by the convex program (4.1) is paramet- 
rically consistent: 

g,{S n - K* 0l L n - Kh^K^K^o) < yjl . 

3 We use the notation a > b if there exists a function r(a, /3, v, ?/>) such that a > r(a, f3, v, ip)b. Similarly we use the 
notation a x b if there exists a function r(a, f3, v, i/j) such that a — r(a, ft, v, ip)b. 
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The proof of this theorem is given in Appendix D. The theorem essentially states that if the 
minimum nonzero singular value of the low-rank piece Kq h(Kjj) _1 Kjj q and minimum nonzero 
entry of the sparse piece Kq are bounded away from zero, then the convex program (4.1) provides 
estimates that are both algebraically consistent and parametrically consistent (in the (.^ and spec- 
tral norms). In Section 4.4 we also show that these results easily lead to parametric consistency 
rates for the corresponding estimate (S n — L n ) of the marginal covariance of the observed 
variables. 

Notice that the condition on the minimum singular value of K Q ^(i^) _1 Kjj is more stringent 
than on the minimum nonzero entry of Kq. One role played by these conditions is to ensure that the 
estimates (S n ,L n ) do not have smaller support size/rank than (K Q , K Q H {K* H )~ l K* H Q ). However 
the minimum singular value bound plays the additional role of bounding the curvature of the low- 
rank matrix variety around the point Kq ^K^j) -1 K^ q, which is the reason for this condition being 
more stringent. Notice also that the number of hidden variables h does not explicitly appear in the 
bounds in Theorem 4.1, which only depend on p, /j,(Q,(K )), £(T(Kq H {K* H )~ l K* H )). However 
the dependence on h is implicit in the dependence on £(T(Kq jj(K^ )~ 1 K^ g )), and we discuss this 
point in greater detail in the following section. 

Finally we note that algebraic and parametric consistency hold under the assumptions of The- 
orem 4.1 for a range of values of 7: 

\ ^{2 - v)tj{T) ua 
7 [ va ' 2/3(2 - i/)p(n). ' 

In particular the assumptions on the sample complexity, the minimum nonzero singular value of 
Kq H (K^)~ l Kft , and the minimum magnitude nonzero entry of Kq are governed by the lower 
end of this range for 7. These assumptions can be weakened if we only require consistency for a 
smaller range of values of 7. The following corollary conveys this point with a specific example: 

Corollary 4.2. Consider the same setup and notation as in Theorem 4-1- Suppose that the quan- 
tities //(ri) and £(T) satisfy the assumption of Proposition 3.6 for identifiability. Suppose that we 
make the following assumptions: 

1. Let 7 be chosen to be equal to 2/3(2-u)^(Cl) u PP er en d °f the range specified in Proposi- 
tion 3.6), i.e., 7 x ^py. 

2. n > p. 

3. AnX^)^. 

4- 0- ^ -J(T)\Jn- 
5. 6 > 

Then with probability greater than 1 — 2exp{— p} we have estimates (S n ,L n ) that are algebraically 
consistent, and parametrically consistent with the error bounded as 

9l (S n - K* ,L n - K 0tH (K* H )- l K* Ht0 ) < M (fi) 
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The proof of this corollary 4 is analogous to that of Theorem 4.1. We emphasize that in practice 
it is often beneficial to have consistent estimates for a range of values of 7 (as in Theorem 4.1). 
Specifically the stability of the sparsity pattern and rank of the estimates (S n , L n ) for a range of 
tradeoff parameters is useful in order to choose a suitable value of 7, as prior information about 
the quantities fi(Q(K )) and £(T '(K Q ^(AT^) -1 K* H )) is not typically available (see Section 5). 

4.3 Scaling regimes 

Next we consider classes of latent- variable models that satisfy the conditions of Theorem 4.1. 
Recall that n denotes the number of samples, p denotes the number of observed variables, and 
h denotes the number of latent variables. Recall the assumption that the quantities a, /3, u, ip 
defined in Section 3.4 remain bounded, and do not scale with the other parameters such as 
(p,h,n) or ^{T {Kq H {K* H )^ 1 )) or /j>(Q,(K )). In particular we focus on the tradeoff be- 
tween £(T \Kq H {K* H )~ l K* H )) and h(Q<{Kq)) (the quantities that control the complexity of a 
latent- variable graphical model), and the resulting scaling regimes for consistent estimation. Let 
d = deg(A'Q) denote the degree of the conditional graphical model among the observed variables, 
and let i = inc(AQ ^(A^) -1 Ajj ) denote the incoherence of the correlations induced due to 
marginalization over the latent variables (we suppress the dependence on n). These quantities are 
defined in Section 3, and we have from Propositions 3.1 and 3.3 that 

Kn(K* )) < d, anK* , H ( K H)' lK H,o)) < 2i - 

Since a, (3, v, ip are assumed to be bounded, we also have from Proposition 3.6 that the product of 
jj, and £ must be bounded by a constant. Thus, we study latent-variable models in which 

di = 0(1). 

As we describe next, there are non-trivial classes of latent-variable graphical models in which this 
condition holds. 

Bounded degree and incoherence: The first class of latent- variable models that we consider 
are those in which the conditional graphical model among the observed variables (given by Kq) 
has constant degree d. Recall from equation (3.3) that the incoherence i of the effect of the latent 

variables (given by K* Q H {K* H )~ X K* H Q ) can be as small as ■J~^ ) . Consequently latent-variable models 
in which 

d = 0(l), h~p, 

can be estimated consistently from n ~ p samples as long as the low-rank matrix K Q H {K^)~ l Ktf 

is almost maximally incoherent, i.e., i ~ so ^ ne effect of marginalization over the latent variables 
is diffuse across almost all the observed variables. Thus consistent latent-variable model selection 
is possible even when the number of samples and the number of latent variables are on the same 
order as the number of observed variables. 

Polylogarithmic degree The next class of models that we study are those in which the degree 
d of the conditional graphical model of the observed variables grows poly-logarithmically with p. 

4 By making stronger assumptions on the Fisher information matrix X*, one can further remove the factor of £(T) 
in the lower bound for a. Specifically the lower bound a > /i(£l) 3 sj^ suffices for consistent estimation if ar,pT 
bound the minimum/maximum gains of T* for all matrices (rather than just those near T), and St bounds the 
I* -inner-product for all pairs of orthogonal matrices (rather than just those near T and T x ). 
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Consequently, the incoherence i of the matrix K Q h (K^) q must decay as the inverse of poly- 
log(p). Using the fact that maximally incoherent low-rank matrices K Q ^Kjj) -1 can have 

incoherence as small as </-, latent- variable models in which 

V p' 

d~\og{p)\ h 



\og{p) 2q ' 

can be consistently estimated as long as n ~ p poly-log(p). 
4.4 Rates for covariance matrix estimation 

The main result Theorem 4.1 gives conditions under which we can consistently estimate the sparse 
and low-rank parts that compose the marginal concentration matrix K Q . Here we prove a corollary 
that gives rates for covariance matrix estimation, i.e., the quality of the estimate (S n — L n ) with 
respect to the "true" marginal covariance matrix Sq. 

Corollary 4.3. Under the same conditions as in Theorem 4-1, we have with probability greater 
than 1 — 2exp{— p} that 

57 (^[(S n _L n )-i_ S y)<^y|. 

Specifically this implies that \\(S n — L n )~ l — Sq||2 < SjT)\[^' 

Proof: The proof of this lemma follows directly from duality. Based on the analysis in Ap- 
pendix D (in particular using the optimality conditions of the modified convex program (D.8)), we 
have that 

g^A^iSn-Ln)- 1 -^)) < A n . 

We also have from the bound on the number of samples n that with probability greater than 
1 — 2exp{— p} (see Appendix D.7) 

9l (A^o-X })<K 
Based on the choice of X n in Theorem 4.1, we then have the desired bound. □ 



4.5 Proof strategy for Theorem 4.1 

Standard results from convex analysis [31] state that (S n , L n ) is a minimum of the convex program 
(4.1) if the zero matrix belongs to the subdifferential of the objective function evaluated at (>S* n , Ln) 
(in addition to (S n ,L n ) satisfying the constraints). The subdifferential of the t\ norm at a matrix 
M is given by 

N € 0||Af ||i O Vn( M )(N) = sign(M), ^(M)^)^ < 1. 

For a symmetric positive semidefinite matrix M with SVD M = UDU , the subdifferential of the 
trace function restricted to the cone of positive semidefinite matrices (i.e., the nuclear norm over 
this set) is given by: 

N G d[tr(M) + I Mt o] V T[M ){N) = UU T , V T[M y(N) ± I, 

where Im>o denotes the characteristic function of the set of positive semidefinite matrices (i.e., 
the convex function that evaluates to over this set and oo outside). The key point is that 
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elements of the sub differential decompose with respect to the tangent spaces f2(M) and T{M). 
This decomposition property plays a critical role in our analysis. In particular it states that the 
optimality conditions consist of two parts, one part corresponding to the tangent spaces O and T 
and another corresponding to the normal spaces Q 1 - and T -1 -. 

Consider the optimization problem (4.1) with the additional (non-convex) constraints that the 
variable S belongs to the algebraic variety of sparse matrices and that the variables L belongs to 
the algebraic variety of low-rank matrices. While this new optimization problem is non-convex, 
it has a very interesting property. At a globally optimal solution (and indeed at any locally opti- 
mal solution) (S, L) such that S and L are smooth points of the algebraic varieties of sparse and 
low-rank matrices, the first-order optimality conditions state that the Lagrange multipliers corre- 
sponding to the additional variety constraints must lie in the normal spaces Q(S) ± and T(L)- 1 -. 
This fundamental observation, combined with the decomposition property of the subdifferentials 
of the l\ and nuclear norms, suggests the following high-level proof strategy: 

1. Let (S,L) be the globally optimal solution of the optimization problem (4.1) with the addi- 
tional constraints that (S,L) belong to the algebraic varieties of sparse/low-rank matrices; 
specifically constrain S to lie in S (| support (Kq)\) and constrain L to lie in £(r&nk(KQ H {K^ )~ 1 K* H Q )). 
Show first that (S, L) are smooth points of these varieties. 

2. The first part of the subgradient optimality conditions of the original convex program (4.1) 
corresponding to components on the tangent spaces Q(S) and T(L) is satisfied. This conclu- 
sion can be reached because the additional Lagrange multipliers due to the variety constraints 
lie in the normal spaces Q(S) and T{L)^~. 

3. Finally show that the second part of the subgradient optimality conditions of (4.1) (without 
any variety constraints) corresponding to components in the normal spaces J7(5')- L and T(L) 1 - 
is also satisfied by (S,L). 

Combining these steps together we show that (5, L) satisfy the optimality conditions of the 
original convex program (4.1). Consequently (S,L) is also the optimum of the convex program 
(4.1). As this estimate is also the solution to the problem with the variety constraints, the algebraic 
consistency of (S, L) can be directly concluded. We emphasize here that the variety-constrained 
optimization problem is used solely as an analysis tool in order to prove consistency of the estimates 
provided by the convex program (4.1). These steps describe our broad strategy, and we refer the 
reader to Appendix D for details. The key technical complication is that the tangent spaces at 
L and K* Q H {K* H )~ l K* H are in general different. We bound the twisting between these tangent 
spaces by using the fact that the minimum nonzero singular value of Kq H {K^)~ l K^ is bounded 
away from zero (as assumed in Theorem 4.1 and using Proposition 2.1). 

5 Simulation Results 

In this section we give experimental demonstration of the consistency of our estimator (4.1) on 
synthetic examples, and its effectiveness in modeling real-world stock return data. Our choices of 
\ n and 7 are guided by Theorem 4.1. Specifically, we choose X n to be proportional to -i / -. For 7 




we observe that the support/sign-pattern and the rank of the solution (S n ,L n ) are the same for a 
range of values of 7. Therefore one could solve the convex program (4.1) for several values of 7, and 
choose a solution in a suitable range in which the sign-pattern and rank of the solution are stable. 
In practical problems with real-world data these parameters may be chosen via cross-validation. 
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Figure 1: Synthetic data: Plot showing probability of consistent estimation of the number of latent 
variables, and the conditional graphical model structure of the observed variables, the three models 
studied are (a) 36-node conditional graphical model given by a cycle with h = 2 latent variables, (b) 
36-node conditional graphical model given by a cycle with h = 3 latent variables, and (c) 36-node 
conditional graphical model given by a 6 x 6 grid with h = 1 latent variable. For each plotted 
point, the probability of consistent estimation is obtained over 50 random trials. 



For small problem instances we solve the convex program (4.1) using a combination of YALMIP [25] 
and SDPT3 [34], which are standard off-the-shelf packages for solving convex programs. For larger 
problem instances we use the special purpose solver LogdetPPA [36] developed for log-determinant 
semidefinite programs. 

5.1 Synthetic data 

In the first set of experiments we consider a setting in which we have access to samples of the 
observed variables of a latent-variable graphical model. We consider several latent- variable Gaussian 
graphical models. The first model consists of p = 36 observed variables and h = 2 hidden variables. 
The conditional graphical model structure of the observed variables is a cycle with the edge partial 
correlation coefficients equal to 0.25; thus, this conditional model is specified by a sparse graphical 
model with degree 2. The second model is the same as the first one, but with h = 3 latent variables. 
The third model consists of h = 1 latent variable, and the conditional graphical model structure 
of the observed variables is given by a 6 x 6 nearest-neighbor grid (i.e., p = 36 and degree 4) 
with the partial correlation coefficients of the edges equal to 0.15. In all three of these models 
each latent variable is connected to a random subset of 80% of the observed variables (and the 
partial correlation coefficients corresponding to these edges are also random). Therefore the effect 
of the latent variables is "spread out" over most of the observed variables, i.e., the low-rank matrix 
summarizing the effect of the latent variables is incoherent. 

For each model we generate n samples of the observed variables, and use the resulting sample 
covariance matrix Y7q as input to our convex program (4.1). Figure 1 shows the probability of 
recovery of the support/sign-pattern of the conditional graphical model structure in the observed 
variables and the number of latent variables (i.e., probability of obtaining algebraically consistent 
estimates) as a function of n. This probability is evaluated over 50 experiments for each value of 
n. 
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Figure 2: Stock returns: The figure on the left shows the sparsity pattern (black denotes an 
edge, and white denotes no edge) of the concentration matrix of the conditional graphical model 
(135 edges) of the stock returns, conditioned on 5 latent variables, in a latent-variable graphical 
model (total number of parameters equals 639). This model is learned using (4.1), and the KL 
divergence with respect to a Gaussian distribution specified by the sample covariance is 17.7. The 
figure on the right shows the concentration matrix of the graphical model (646 edges) of the stock 
returns, learned using standard sparse graphical model selection based on solving an ^-regularized 
maximum-likelihood program (total number of parameters equals 730). The KL divergence between 
this distribution and a Gaussian distribution specified by the sample covariance is 44.4. 



In all of these cases standard graphical model selection applied directly to the observed variables 
is not useful as the marginal concentration matrix of the observed variables is not well-approximated 
by a sparse matrix. These experiments agree with our theoretical results that the convex program 
(4.1) is an algebraically consistent estimator of a latent-variable model given (sufficiently many) 
samples of only the observed variables. 

5.2 Stock return data 

In the next experiment we model the statistical structure of monthly stock returns of 84 companies 
in the S&P 100 index from 1990 to 2007; we disregard 16 companies that were listed after 1990. 
The number of samples n is equal to 216. We compute the sample covariance based on these returns 
and use this as input to (4.1). 

The model learned using (4.1) for suitable values of A n ,7 consists of h = 5 latent variables, 
and the conditional graphical model structure of the stock returns conditioned on these hidden 
components consists of 135 edges. Therefore the number of parameters in the model is 84 + 135 + 
(5 x 84) = 639. The resulting KL divergence between the distribution specified by this model 
and a Gaussian distribution specified by the sample covariance is 17.7. Figure 2 (left) shows the 
conditional graphical model structure. The strongest edges in this conditional graphical model, as 
measured by partial correlation, are between Baker Hughes - Schlumberger, A.T.&T. - Verizon, 
Merrill Lynch - Morgan Stanley, Halliburton - Baker Hughes, Intel - Texas Instruments, Apple - 
Dell, and Microsoft - Dell. It is of interest to note that in the Standard Industrial Classification 5 
system for grouping these companies, several of these pairs are in different classes. As mentioned 
in Section 2.2 our method estimates a low-rank matrix that summarizes the effect of the latent 
variables; in order to factorize this low-rank matrix, for example into sparse factors, one could use 
methods such as those described in [38]. 

See the United States Securities and Exchange Commission website at 
http://www.sec.gov/info/edgar/siccodes.htm 
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We compare these results to those obtained using a sparse graphical model learned using t\- 
regularized maximum-likelihood (see for example [29]), without introducing any latent variables. 
Figure 2 (right) shows this graphical model structure. The number of edges in this model is 646 
(the total number of parameters is equal to 646 + 84 = 730), and the resulting KL divergence 
between this distribution and a Gaussian distribution specified by the sample covariance is 44.4. 
Indeed to obtain a comparable KL divergence to that of the latent-variable model described above, 
one would require a graphical model with over 3000 edges. 

These results suggest that a latent-variable graphical model is better suited than a standard 
sparse graphical model for modeling the statistical structure among stock returns. This is likely 
due to the presence of global, long-range correlations in stock return data that are better modeled 
via latent variables. 

6 Discussion 

We have studied the problem of modeling the statistical structure of a collection of random vari- 
ables as a sparse graphical model conditioned on a few additional hidden components. As a first 
contribution we described conditions under which such latent-variable graphical models are iden- 
tifiable given samples of only the observed variables. We also proposed a convex program based 
on regularized maximum-likelihood for latent-variable graphical model selection; the regularization 
function is a combination of the i\ norm and the nuclear norm. Given samples of the observed 
variables of a latent-variable Gaussian model we proved that this convex program provides con- 
sistent estimates of the number of hidden components as well as the conditional graphical model 
structure among the observed variables conditioned on the hidden components. Our analysis holds 
in the high-dimensional regime in which the number of observed/latent variables are allowed to 
grow with the number of samples of the observed variables. In particular we discuss certain scaling 
regimes in which consistent model selection is possible even when the number of samples and the 
number of latent variables are on the same order as the number of observed variables. These theo- 
retical predictions are verified via a set of experiments on synthetic data. We also demonstrate the 
effectiveness of our approach in modeling real-world stock return data. 

Several research questions arise that are worthy of further investigation. While the convex 
program (4.1) can be solved in polynomial time using off-the-shelf solvers, it is preferable to develop 
more efficient special-purpose solvers that can scale to massive datasets by taking advantage of 
the structure of the formulation (4.1). Finally it would be of interest to develop a similar convex 
optimization formulation with consistency guarantees for latent-variable models with non-Gaussian 
variables, e.g., for categorical data. 
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A Matrix Perturbation Bounds 

Given a low-rank matrix we consider what happens to the invariant subspaces when the matrix 
is perturbed by a small amount. We assume without loss of generality that the matrix under 
consideration is square and symmetric, and our methods can be extended to the general non- 
symmetric non-square case. We refer the interested reader to [2, 21] for more details, as the results 
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presented here are only a brief summary of what is relevant for this paper. In particular the 
arguments presented here are along the lines of those presented in [2] . The appendices in [2] also 
provide a more refined analysis of second-order perturbation errors. 

The resolvent of a matrix M is given by (M — C-0 -1 [21] j an d it is well-defined for all £ G C 
that do not coincide with an eigenvalue of M. If M has no eigenvalue with magnitude equal to rj, 
then we have by the Cauchy residue formula that the projector onto the invariant subspace of a 
matrix M corresponding to all singular values smaller than rj is given by 

pm, v = i (m - ar^c, (a.i) 

J Cr] 

where denotes the positively-oriented circle of radius rj centered at the origin. Similarly, we have 
that the weighted projection onto the invariant subspace corresponding to the smallest singular 
values is given by 

P§ >r] = MPm, v = ^7 <f C (M - ar^C, (A.2) 



Suppose that M is a low-rank matrix with smallest nonzero singular value a, and let A be a 
perturbation of M such that ||A||2 < k < §■ We have the following identity for any \Q = k, which 
will be used repeatedly: 

[(M + A) - C/]" 1 - [M - C/]" 1 = — [M — CI] _1 A[(M + A) - (I}" 1 . (A.3) 

We then have that 

Pm+a, k - Pm, k = 77~t <f [{M + A) — (I}' 1 - [M - Cl^dC 

= j> [M — a] _1 A[(M + A) - C/]" 1 ^. (A.4) 

Similarly, we have the following for : 

p§+a, k - pz, k = ^ / c {[(m + a) - c/]- 1 - [m - cir 1 }^ 

= Tj^jf C {[M-C/]- 1 A[(M + A)-C/]- 1 K 

= -L / c [m - cir^iM - ar 1 ^ 

2m Jc k 

~~Lij C[M ~ C/ ]~ lA t M - C^] _1 A[(M + A) - C/]" 1 ^. 

(A.5) 

Given these expressions, we have the following two results. 

Proposition A.I. Let M 6 MP xp be a rank-r matrix with smallest nonzero singular value equal to 
a, and let A be a perturbation to M such that ||A||2 < § with k < j. Then we have that 

\\Pm+A,k — Pm,k\\2 < 7 w oTT ll^lb- 
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Proof: This result follows directly from the expression (A. 4), and the sub-multiplicative property 
of the spectral norm: 

\\Pm+A,k - Pm,k||2 < TT 27T K ||A|| 2 o- 

2ir o~ — k u 



2 



il A lb- 



(a- K )(a-f) 

Here, we used the fact that \\[M - (1]-% < ^ and ||[(M + A) - < -V for |C| = k. □ 

a 2 

Next, we develop a similar bound for PV^ . Let U(M) denote the invariant subspace of M corre- 
sponding to the nonzero singular values, and let Pjj^m) denote the projector onto this subspace. 

Proposition A. 2. Let M 6 MP xp be a rank-r matrix with smallest nonzero singular value equal to 
a, and let A be a perturbation to M such that ||A||2 < § with k < ^. Then we have that 

K 2 

\\ P M+A,k ~ Pm,k -(I- Pu{M))&{! ~ Pu(M))h < ( a _ K )2( a _ 3«) H^Hi" 

Proof: One can check that 

^- jf C [M - (Il^AlM - C/]- 1 ^ = (/ - P U{M) )A(I - P U{M) ). 

Next we use the expression (A. 5), and the sub- multiplicative property of the spectral norm: 



\Pm+A,k ~ P M,k ~ (I - Pu(M))A(I - Pu(M))h 

1 1 ..... 1 

< — 2tt k k A h A 2 ^~ 

2ir a — k a — k a — t£ 



ll A ||2 



As with the previous proof, we used the fact that || [M — 1| 2 < ^ and || [(M + A) - (I]" 1 1| 2 < 
i for |C| = «. □ 

u 2 

We will use these expressions to derive bounds on the "twisting" between the tangent spaces at 
M and at M + A with respect to the rank variety. 

B Curvature of Rank Variety 

For a symmetric rank-r matrix M, the projection onto the tangent space T(M) (restricted to the 
variety of symmetric matrices with rank less than or equal to r) can be written in terms of the 
projection Pu{m) onto the row space U{M). For any matrix A 

Pt{M){ N ) = Pu{M) N + NP U{M) - Pu{M) N Pu(M)- 

One can then check that the projection onto the normal space T(M) 1 - 

P T (M)±(N) = {I- Pt(m)](N) = (I- Pu(M)) N(I- P u[m) ). 
Proof of Proposition 2.1: For any matrix A, we have that 

[Pt{m+a) ~ Pt(m)]( n ) = 

[Pu(M+A) ~ Pu(M)] A [I - Pu(M)] + [I ~ Pu{M+A)] N [Pu(M+A) ~ Pu(M)\- 
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Further, we note that for k < ? 

Pu(M+A) - Pu(M) = [I - Pu{M)\ - [I - Pu(M+A)] 

= Pm,k~Pm+a,k, 

where Pm,k is defined in the previous section. Thus, we have the following sequence of inequalities 
for k = §:' 

p(T(M + A), T(M)) = max i || [P U{M +A) - P U(M) ] N [I - P U{M) ] 

+ [I ~ Pu(M+A)] N [Pu(M+A) ~ Pu(M)]h 

- ,,11?^ \\[ P U(M+A) ~ Pu(M)] N [I- Pu{M)\h 
||jV||2<1 

+ max^ || [I - P U{M +A)] N [Pu( M +A) ~ p u(M)] h 

< 2 ||Pm+a,| — -Pm,| lb 

2 ..All 

< - A 2 , 
a 

where we obtain the last inequality from Proposition A.l. □ 

Proof of Proposition 2.2: Since both M and M + A are rank-r matrices, we have that 
Pm+A, k = Pm, k = for k = f . Consequently, 

||P T(M) x(A)|| 2 = ||(I - Pu{M)) A (J — fj7(M))l|2 

'|A|' 2 



< 



2 



CJ 

where we obtain the last inequality from Proposition A. 2 with k = ^. □ 

Proof of Lemma 3.2: Since p{T\,T 2 ) < 1 one can check that the largest principal angle 
between T\ and T 2 is strictly less than ^. Consequently, the mapping Vt 2 '■ T% — > T 2 restricted 
to T\ is bijective (as it is injective, and the spaces T±,T 2 have the same dimension). Consider the 
maximum and minimum gain of the operator Vt 2 restricted to Ti; for any M € T\, \\M || 2 = 1: 

||Pr 2 (M)|| 2 = \\m+[Vt 2 -V Ti }(M)\\ 2 

€ [1 - piT^^l + p(T u T 2 )]. 

Therefore, we can rewrite £(T 2 ) as follows: 

£(T 2 ) = max HiVNoo 
A r er 2 ,||iV||2<i 

max ||^T 2 (A r )||oo 

JVGT2,||Ar|| 2 <l 

< max ||^Zb(JNOI|oo 

< max [\\N\\oo + \\[rn-VT 2 }(N)\\oo] 

NeT 1 ,\\N\\ 2 < T ^^ ) 



< 



< 



1 



p(Ti,T 2 ) 
1 



1 - p(Ti,T 2 ) 



+ max ||[p Ti _p T2 ](iV)|| c 
NeTi,\\N\\a<l 



e(Tx) + max ||[P Tl - Vt 2 ](N)\\ 2 

\\N 2<1 



< ^— — [£(?!) + p(r!,r 2 )] 



1 - p(Ti,T 2 
This concludes the proof of the lemma. □ 
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C Trans versality and Identifiability 

Proof of Lemma 3.5: We have that J$A{S, L) = (S + L, S + L); therefore, V y A^AVy(S, L) = 
(S + Vq(L),Vt{S) + L). We need to bound \\S + Pn(L)||oo and \\V T (S) + L\\ 2 . First, we have 

||-S , + Pn(L)|| 00 G [Halloo - H^oWlloo, Halloo + II^C^Hoo] 

— [ll'S'lloo — Halloo) Halloo + ll^lloo] 

C [ 7 -£(T), 7 + £(T)]. 

Similarly, one can check that 

\\Pt(S)+L\\ 2 g [-||Pt(S)|| 2 + ||L|| 2 ,||P t (S)|| 2 + ||L|| 2 ] 
C [1-2||S|| 2 ,1 + 2||S|| 2 ] 

c [i-27/i(n),i + 27/i(n)]. 

Thus, we can conclude that 

g^VyAUVy(S,L)) G [1 - x (fi, T, 7), 1 + x (fi, T, 7)]. 

where X (0, T, 7) is defined in (3.5). □ 

Proof of Proposition 3.6: Before proving the two parts of this proposition we make a simple 
observation about £(T") using the condition that p(T,T') < by applying Lemma 3.2: 



l-p(T,T') 

3«(T) 



< - 2 



1 _ id 
1 2 



< 3e(T). 

Here we used the property that £(T) < 1 in obtaining the final inequality. Consequently, noting 

i/a ' 2/3(2-i/)/j(n)J 



that 7 G [ 3/3(2 7/ (T) > wo^zny] ^Plies that 



X (^,T / , 7 ) = max|i^,2MO)7| < ^ . (C.l) 



Part 1: The proof of this step proceeds in a similar manner to that of Lemma 3.5. First we 
have for S G fl, L G T' with ||5||oo = 7, H-^lb = L 

WVnTtf + L)^ > \\VqX*S\\ 00 - H^n^lU 

> 07 — ||I*L||oo 

> a 7 -/3£(T'). 

Next under the same conditions on S,L, 

\\V T 'T*(S + L)\\ 2 > \\V T 'TL\\ 2 - \\v t >ts\\ 2 

> a-2||X*5|| 2 

> a - 2/9/i(n)7. 
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Combining these last two bounds with (C.l), we conclude that 



mm 

(S,L)ey, ||S|| 00 = 7 , ||L|| 2 =1 



g 7 (V y A^I*AVy(S,L)) > a - (3 max j^p, M^h 



> a 



VOL 



2-v 
2a(l - v) 
2-v 



~ 2' 

where the final inequality follows from the assumption that v G (0, ^]. 
Part 2: Note that for S £U,LeT' with ||5||oo < 7, ||L|| 2 < 1 

Similarly 

||P r ,xZ*(S + L)|| 2 < ||P T ,xX*5|| 2 + ||P T /±X*L|| 2 
< /3 7 /x(ft) + <5. 

Combining these last two bounds with the bounds from the first part, we have that 

5 + /3max{ffi,2^) 7 } 



V y ±A^TAPy [VyA^TAVy 



-1 



< y-j- i J j 



< 



< 



5 + 

~ 2-v 



(l_jfc,) a +^ 



a 

1 - i/. 



2-v 



This concludes the proof of the proposition. □ 



D Proof of main result 

Here we prove Theorem 4.1. Throughout this section we denote m = max{l, -}. Further Vt = 
n(K^) and T = T(K* Q H (K* H )- l K* H Q ) denote the tangent spaces at the "true" sparse matrix 
S* = Kq and low-rank matrix L* = K* Q H (K* H )~ l K* H Q . We assume that 



7 G 



3/3(2 - v)i{T) 



va 



VOL 



2/3(2 - u)fi(il) 



(D.l) 



We also let E n = Y!q — Y**q denote the difference between the true marginal covariance and the 
sample covariance. Finally we let D = max{l, gwfz^) } throughout this section. For 7 in the above 
range we note that 

m < -7—r. (D.2) 
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Standard facts that we use throughout this section are that < 1 and that ||M < ||M||2 for 
any matrix M. 

We study the following convex program: 

(S n ,L n ) = argmintr[(5-L) Eg] — logdet(5 — L) + A n [ 7 ||S||i + ||L||„] 

S ' L (D.3) 
s.t. S-L>-0. 

Comparing (D.3) with the convex program (4.1), the main difference is that we do not constraint 
the variable L to be positive semidefinite in (D.3) (recall that the nuclear norm of a positive 
semidefinite matrix is equal to its trace). However we show that the unique optimum (S n ,L n ) of 
(D.3) under the hypotheses of Theorem 4.1 is such that L n y (with high probability). Therefore 
we conclude that (S n ,L n ) is also the unique optimum of (4.1). The sub differential with respect to 
the nuclear norm at a matrix M with (reduced) SVD given by M = UDV T is as follows: 

Ned\\M\\* O P T (M)(N) = UV T , \\V t(M) x(N)\\ 2 <1. 

The proof of this theorem consists of a number of steps, each of which is analyzed in separate 
sections below. We explicitly keep track of the constants a,f3,v,ij). The key ideas are as follows: 

1. We show that if we solve the convex program (D.3) subject to the additional constraints that 
Sgfi and L € T' for some X" "close to" T (measured by p(T',T)), then the error between 
the optimal solution (S n ,L n ) and the underlying matrices (S*,L*) is small. This result is 
discussed in Appendix D.2. 

2. We analyze the optimization problem (D.3) with the additional constraint that the variables 
S and L belong to the algebraic varieties of sparse and low-rank matrices respectively, and 
that the corresponding tangent spaces are close to the tangent spaces at (S*,L*). We show 
that under suitable conditions on the minimum nonzero singular value of the true low-rank 
matrix L* and on the minimum magnitude nonzero entry of the true sparse matrix S* , the 
optimum of this modified program is achieved at a smooth point of the underlying varieties. In 
particular the bound on the minimum nonzero singular value of L* helps bound the curvature 
of the low-rank matrix variety locally around L* (we use the results described in Appendix B). 
These results are described in Appendix D.3. 

3. The next step is to show that the variety constraint can be linearized and changed to a 
tangent-space constraint (see Appendix D.4), thus giving us a convex program. Under suitable 
conditions this tangent-space constrained program also has an optimum that has the same 
support/rank as the true (S*,L*). Based on the previous step these tangent spaces in the 
constraints are close to the tangent spaces at the true (S*,L*). Therefore we use the first 
step to conclude that the resulting error in the estimate is small. 

4. Finally we show that under the identifiability conditions of Section 3 these tangent-space 
constraints are inactive at the optimum (see Appendix D.7). Therefore we conclude with the 
statement that the optimum of the convex program (D.3) without any variety constraints is 
achieved at a pair of matrices that have the same support /rank as the true (S*,L*) (with 
high probability). Further the low-rank component of the solution is positive semidefinite, 
thus allowing us to conclude that the original convex program (4.1) also provides estimates 
that are consistent. 
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D.l Bounded curvature of matrix inverse 

Consider the Taylor series of the inverse of a matrix: 

(M + A)" 1 = M" 1 - M _1 AM _1 + R M -i(A), 

where 



R M -i(A) = M~ 1 



lk=2 

This infinite sum converges for A sufficiently small. The following proposition provides a bound 
on the second-order term specialized to our setting: 

Proposition D.l. Suppose that 7 is in the range given by (D.l). Let g^(As, Ai) < for 
C\ = + -^), and for any (As, Al) with As € f2. Then we have that 



Proof: We have that 



P(A 5 ,A L )|| 2 < ||A S || 2 + ||A L | 



< 7/i (n)J!^£k + || Ai || a 

7 

< (l + 7At (0)) 57 (As,Ai) 

a 

< (1 + ^)<7 7 (A5,A L ) 
1 

where the second-to-last inequality follows from the range for 7 (D.l) and that v € (0, i], and the 
final inequality follows from the bound on g^(As, Al). Therefore, 

00 

\\R^ o (A(A s ,A L ))\\ 2 < VEdl^ + ^VO* 

k=2 

(.3|| A - , A _ 112 



< ^\\A S + A L \\ Z 2 



1 - \\A S + AM 



< 2^ 3 (1 + ^) 2 57 (A 5 ,A L ) 2 
= 2^C 2 <7 7 (A S ,A L ) 2 . 

Here we apply the last two inequalities from above. Since the || • ||oo-norm is bounded above by the 
spectral norm || • H2, we have the desired result. □ 

D.2 Bounded errors 

Next we analyze the following convex program subject to certain additional tangent-space con- 
straints: 

(Sn,L T ,) = argmintr[(S-L) Eq] - logdet(5 — L) + A n [ 7 ||S||i + \\L\\,] 

S > L (D.4) 
s.t. S-LyO, Sen, LeT' 
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for some subspace T' . We show that if T' is any tangent space to the low-rank matrix variety such 
that p(T,T') < ^p-, then we can bound the error (As, Al) = (Sq — S* , L* — Ly). Let Cr> = 
Vj'r±(L*) denote the normal component of the true low-rank matrix at T' , and recall that E n = 
T,q — denotes the difference between the true marginal covariance and the sample covariance. 
The proof of the following result uses Brouwer's fixed-point theorem [28], and is inspired by the 
proof of a similar result in [29] for standard sparse graphical model recovery without latent variables. 

Proposition D.2. Let the error (As, Al) in the solution of the convex program (D.4) (with T' 
such that p(T',T) < ^p) be as defined above. Further let C\ = il)(\ + -^k), and define 

r = max /— g 7 (A^E n ) + g 1 (A^X*Cr') + A n , \\Ct> 

If we have that 

r < mm 

for 7 in the range given by (D.l), then 
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5 7 (A S ,A L ) < 2r. 

Proof: Based on Proposition 3.6 we note that the convex program (D.4) is strictly convex 
(because the negative log-likelihood term has a strictly positive-definite Hessian due to the con- 
straints involving transverse tangent spaces), and therefore the optimum is unique. Applying the 
optimality conditions of the convex program (D.4) at the optimum (Sq,Lt'), we have that there 
exist Lagrange multipliers Qq± G Q, , Qt /X ^ T such that 

T^-(Su-L T ,)- x + Q^ G -A n7 d||5n||i, Xo - (Sq - L T >)~ 1 + Q T >^ G Kd\\L TI \\*. 

Restricting these conditions to the space y = x T' , one can check that 

^hPo ~~ O^n ~~ -^t')" 1 ] = Zn> TVPo ~ O^n ~ L T i)~ l ] = Z T >, 

where Zq G fi, Zt' G T' and ||^n||oo = A„7, \\Zt' H2 < 2A n (we use here the fact that projecting onto 
a tangent space T' increases the spectral norm by at most a factor of two). Denoting Z = [Zq, Zt'], 
we conclude that 

VyA^ n - (Sq - It,)' 1 ] = Z, (D.5) 

with g y (Z) < 2A n . Since the optimum (Sq, Lt>) is unique, one can check using Lagrangian duality 
theory [31] that (Sq, Lt 1 ) is the unique solution of the equation (D.5). Rewriting — (Sq — Lt>)~ 1 
in terms of the errors (As, Al), we have using the Taylor series of the matrix inverse that 

^-(Sq-Lt,)- 1 = V n -[A(As,A L ) + (Y<hr l r l 

= E n -Rx h (A(As,A L ))+l*A(As,A L ) 

= E n -R^ o (A(A s ,AL))+TAPy(As,AL)+X*C Tl . (D.6) 

Since T' is a tangent space such that p(T',T) < ^p, we have from Proposition 3.6 that the 

operator B = (VyA^X* AVy) 1 from y to y is bijective and is well-defined. Now consider the 
following matrix-valued function from (5s, Sl) G 3^ to y-. 

F(5 S ,S L ) = (5s, S L ) - B {VyA^[E n - R^ (A(5 S ,5 L + C T >)) +TAVy(5 s ,5 L ) +TC T ,\ - z} . 
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A point (S S ,S L ) G y is a fixed-point of F if and only if VyA^[E n - Ry,* (A(8s,8l + C T ')) + 
I*AVy(Ss, 5l) +T*Ct>] = Z. Applying equations (D.5) and (D.6) above, we then see that the 
only fixed-point of F by construction is the "true" error Vy(As, Al) restricted to y. The reason 
for this is that, as discussed above, (Sq,Lj") is the unique optimum of (D.4) and therefore is 
the unique solution of (D.5). Next we show that this unique fixed-point of F lies in the ball 
B r = {{6 S , S L ) | g~,(5s, S L ) < r, (S s , S L ) € y}. 

In order to prove this step, we resort to Brouwer's fixed point theorem [28]. In particular we 
show that the function F maps the ball B r onto itself. Since F is a continuous function and B r is 
a compact set, we can conclude the proof of this proposition. Simplifying the function F, we have 
that 

F(5 S , 5 L ) = B [v y A^[-E n + R^ Q (A(5 S , S L + C T ,)) - 1*C T ,\ + z} . 
Consequently, we have from Proposition 3.6 that 

g 7 (F(S s , S L )) < | 9l (Vy^[E n - Ry h (A(S s , 5 L + C T >)) + Z*C T ,] - Z 
4 

< - 

a 

~ \ + ^S^A j R^ (A(S s ,5 L + C T '))) 



(A ] [E n - Rz h (A(6 s ,5 L +C T ,))+TC T ,}) + A n } 



where in the second inequality we use the fact that g>y(Vy(-, •)) < 2g 7 (-, •) and that g^(Z) < 2\ n , 
and in the final inequality we use the assumption on r. 

We now bound the term g^(A^Ry^ ) (A(Ss, 5l))) using Proposition D.l as g 7 (As, Al) < 

g 7 {A*R-z* n {A(ds,o L +C TI ))) < 



32DipCfr 2 

e(r)a 

32D^Cfr a£(T) 



< 

< 
< 



£(T)a MDi/jC{ 

r 
2' 



where we have used the fact that r < ^^fi • Hence g 1 (Vy(As, Al)) < r by Brouwer's fixed-point 
theorem. Finally we observe that 

5 7 (A 5 ,A L ) < gj (Vy(A s ,A L )) + \\CT>h 
< 2r. 

□ 



D.3 Solving a variety-constrained problem 

In order to prove that the solution (S n , L n ) of (D.3) has the same sparsity pattern/rank as (5*, L*), 
we will study an optimization problem that explicitly enforces these constraints. Specifically, we 
consider the following non- convex constraint set: 

M = {(S,L) | S £ n{S*), rank(L) < rank(L*), 

\\V t j.(L-L*)\\ 2 < ^p^, g 1 {A^Z*A{S - S*,L* - L)) < ll\ n } 
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Recall that S* = K Q and L* = K* Q H {K* H )~ l K* H Q . The first constraint ensures that the tangent 
space at S is the same as the tangent space at S*; therefore the support of S is contained in the 
support of S*. The second and third constraints ensure that L lives in the appropriate low-rank 
variety, but has a tangent space "close" to the tangent space T. The final constraint roughly bounds 
the sum of the errors (S — S*) + (L* — L); note that this does not necessarily bound the individual 
errors. Notice that the only non-convex constraint is that rank(L) < rank(L*). We then have the 
following nonlinear program: 



(^,L A1 ) = argmintr[(5-L) £q] — log det(5 — L) + A n [ 7 ||5||i + 



S,L 



s.t. S-LyO, (S,L)eM. 



(D.7) 



Under suitable conditions this nonlinear program is shown to have a unique solution. Each of the 
constraints in A4 is useful for proving the consistency of the solution of the convex program (D.3). 
We show that under suitable conditions the constraints in M. are actually inactive at the optimal 
{Smi^m)i thus allowing us to conclude that the solution of (D.3) is also equal to (&Mi Am); 
hence the solution of (D.3) shares the consistency properties of (£>m> Am)- A number of interesting 
properties can be derived simply by studying the constraint set Jvi. 

Proposition D.3. Consider any (S,L) € M, and let A$ = S — S*,Al = L* — L. For 7 in the 

range specified by (D.l) and letting C2 = -jf + 557, we have that g 7 (As, A^) < Ci\ n - 

Proof : We have by the triangle inequality that 

g^rA(Vn(As),V T (A L ))) < 11A„ + g y {A^TA(V n ± (A s ), V T ± (A L ))) 

< HA n + m^ 2 ||P T x(A L )|| 2 

< 12A n , 

as m < |£y. Therefore, we have that g y (VyA^I*AVy(As,A L )) < 24A n , where y = Q x T. 
Consequently, we can apply Proposition 3.6 to conclude that 

Finally, we use the triangle inequality again to conclude that 

57 (A 5 ,A L ) < gj (Py(As,A L )) + g^Vy±(A s ,A L )) 

< !^ + mPV( A L )|| 2 
a 

□ 

This simple result immediately leads to a number of useful corollaries. For example we have 
that under a suitable bound on the minimum nonzero singular value of L* = Kq h {K^)^ 1 K^q, 
the constraint in A4 along the normal direction T 1 - is locally inactive. Next we list several useful 
consequences of Proposition D.3. 

Corollary D.4. Consider any (S, L) E M, and let As = S — S* , Al = L* — L. Suppose 7 is in the 

range specified by (D.l), and let C3 = ^ + 1^ C\^) 2 D and C4 = C 2 + 3 "^ 2 3 ^^ (where C 2 is as 

defined in Proposition D.3). Let the minimum nonzero singular value a of L* = K Q H (K^ )~ 1 K^ 

be such that a > |j^ysr for C5 = max{C3,C4}, and suppose that the smallest magnitude nonzero 

entry of S* is greater than ^h- for C 6 = ^zfy- Setting T' = T(L) and C T > = V T i±(L*), we then 
have that: 
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1. L has rank equal to rank(L*), i.e., L is a smooth point of the variety of matrices with rank 
less than or equal to rank(L*). In particular L has the same inertia as L* . 

2. ||7V(A L )|| 2 < ^ 



19Dip 2 ■ 



3. p(T, T') < ^ 



I g y (Ail*C T >) < 



4 • 



6(2-1/) ■ 



6. sign(S') = sign(S'*). 

Proof: We note the following facts before proving each step. First C 2 > ^p- > > |>p-- 

Second £(T) < 1. Third we have from Proposition D.3 that ||Ai||2 < C2A n . Finally 6 ^ 2 ~^ > 18 
for v € (0, i]. We prove each step separately. 
For the first step, we note that 

C 3 X n 19Cfr 2 D\ n 19C 2 A n 

* > > —j(ff— - ~WV ~ 2 n ~ 

Hence L is a smooth point with rank equal to rank(L*), and specifically has the same inertia as 
L*. 

For the second step, we use the fact that a > 8||Ax,||2 to apply Proposition 2.2: 

, A ... \\A L g CUiTfXl i(T)\ n 
a C 3 X n 19Dyj z 

For the third step we apply Proposition 2.1 (by using the conclusion from above that a > 
S 1 1 Z\ z> 1 1 2 ) so that 

, 2||A L || 2 2C 2 j(Tf 2g(T) 2 £(T) 
For the fourth step let a' denote the minimum singular value of L. Consequently, 



0~' > rfrnY) ~ C 2 X n > C 2 X n 



19C 2 D^ 



2 



1 



> 8||A L || 2 . 



f(T)S 

Using the same reasoning as in the proof of the second step, we have that 

\\^T'\\2 < ; < 



< cic(T) 2 x n < <r)A r 



Hence 



6(2 - i/) 

For the fifth step the bound on a' implies that 

<2, 



, C 4 A n 3C 2 a(2-z^ 

-e(rF" 62An - 16(3 -u) Xn 
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Since a' > 8||Al||2, we have from Proposition 2.2 and some algebra that 

||C II < gIM < 16 ( 3 ~ 

11 T '" 2 - a' - 3a(2-i/) ' 

For the final step since HAgU^ < 7C2A n , the assumed lower bound on the minimum magnitude 
nonzero entry of S* guarantees that sign(S') = sign(S**). □ 

Notice that this corollary applies to any (S,L) £ M, and is hence applicable to any solution 
(Sm,Lm) of the .M-constrained program (D.7). For now we choose an arbitrary solution Am) 
and proceed. In the next steps we show that {Sm-, L m) 1S ^ e unique solution to the convex program 
(D.3), thus showing that (Sm,Lm) is also the unique solution to (D.7). 

D.4 From variety constraint to tangent-space constraint 

Given the solution (SmiLm): we show that the solution to the convex program (D.4) with the 
tangent space constraint L € Tm — T(Lm) is the same as (S>m, Lm) under suitable conditions: 



(5 n ,LT M ) = argmintr[(5-L)E5]-logdet( ) S , -L) + A n [ 7 ||5||i + ||L||*] 



S.L 



s.t. S-Lyo, Sen, LeT M . 



(D.8) 



Assuming the bound of Corollary D.4 on the minimum singular value of L* the uniqueness of the 
solution (Sfi, Lt m ) is assured. This is because we have from Proposition 3.6 and from Corollary D.4 
that I* is injective on Q © Tj^. Therefore the Hessian of the convex objective function of (D.8) is 
strictly positive-definite at (S^,Lt m )- 

We let Cm = V T ± (L*). Recall that E n = — Y,* denotes the difference between the sample 
covariance matrix and the marginal covariance matrix of the observed variables. 

Proposition D.5. Let 7 be in the range specified by (D.l). Suppose that the minimum nonzero 
singular value a of L* = Kq h (K^)~ 1 K^ i q is such that a > j^ji (C5 is defined in Corollary D.4)- 



Suppose also that the minimum magnitude nonzero entry of S* is greater than or equal to ^fny fCi 



is defined in Corollary D.4)- Let g 1 {A^E n ) < ^-u) ■ Further suppose that 



6 



3a(2-z^) f 1 a£(T) \ 
An -16(M min fe'64^| 



Then we have that 

(Su,L Tm ) = (£>m,Lm)- 

Proof: Note first that the condition on the minimum singular value of L* in Corollary D.4 is 
satisfied. Therefore we proceed with the following two steps: 

1. First we can change the non-convex constraint rank(L) < rank(L*) to the linear constraint 
L € T(Lm)- This is because the lower bound assumed for a implies that Lm is a smooth 
point of the algebraic variety of matrices with rank less than or equal to rank(L*) (from 
Corollary D.4). Due to the convexity of all the other constraints and the objective, the 
optimum of this "linearized" convex program will still be {Sm/Lm)- 



2. Next we can again apply Corollary D.4 (based on the bound on a) to conclude that the 

;CT)A 



constraint \\T' T ±(L — L*)\\2 < is locally inactive at the point (Sm, Lm)- 
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Consequently, we have that (Smi Lm) can be written as the solution of a convex program: 

(D.9) 



(S M ,L M ) = a rgmmtr[(S-L) T,q] — log det(5 — L) + A n [ 7 ||5||i + 

S.L 



s.t. S-LyO, Sen, LeT M , 

g^{jCTA{S — S*,L* - L)) < HA n . 



We now need to argue that the constraint g 7 (A^T*A(S — S*,L* — L)) < llA n is also inactive 
in the convex program (D.9). We proceed by showing that the solution {Sq,,Lt m ) of the convex 
program (D.8) has the property that g 7 {JvT* A(Sq — S*,L* — Lt m )) < HA n , which concludes 



the proof of this proposition. We have from Corollary D.4 that g 7 {A f I*CT M ) < g /n nt f a • Since 



9^A^E n ) < 



by assumption, one can verify that 



a l 



X n + 9l {A^E n ) + gi {A^Z*C TM ) 



< 



8X n 
a 



1 + 



< mm 



3(2 - v) 
16(3 - v)\ n 
3a(2 - v) 

1 a£(T) 



4Ci 64D^Cf 



The last line follows from the assumption on A n . We also note that 1 1 C^Vt 1 1 2 < "w^z^ f rom 



Corollary D.4, which implies that HCtm^ < min | ^ , ^^ c -i } ■ Letting (As, Al) = (Sn - 

S*,L* — Lt m ), w e can conclude from Proposition D.2 that # 7 (Al, As) < %u(2-u) n 1 Next we a PPly 
Proposition D.l (as g 1 {Ai J , As) < to conclude that 



< 



< 



2D^C\ 32(3 - v)\ n aj(T) 

£(T) 3a(2-i/) Z2Dij)Cl 
2(3 - i/)A n 



3(2-i/) 

From the optimality conditions of (D.8) one can also check that for y = x T/n, 

g^VyA^l* APy (A S ,A L )) < 2X n + g^VyA' R^ o (A s + A L )) 

+ gj (V y A^TC TM ) + 9l (VyA*E n ) 
< 2[\ n +g^Rx h (A s + A L )) 

+g 7 (A^E n )+g 7 {A^I*C TM )] 
~2(3-v)\ n 



(D.10) 



< 4 



3(2 - v) 



Here we used (D.10) in the last inequality, and also that g 1 (A^I*CT M ) < g(2-t) 
from Corollary D.4) and that g 7 (A^E n ) < g^l^A • Therefore, 



(J; 



(P y A^*AVy(A s ,A L )) < 



16A n 



fas noted above 



(D.ll) 
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because v 6 (0, i]. Based on Proposition 3.6 (the second part), we also have that 

g,(Vy^^TAVy(As,A L )) < (1 - u)^ < (D.12) 

Summarizing steps (D.ll) and (D.12), 

g^l*A(A s ,A L )) < gj (VyAil*AVy(As,A L )) 

+g y (V y ^TAVy(A s , A L )) + 9i {A ] X*C Tm ) 
< 16An + 16A !i+ A 



3 3 6(2 - 1/) 

32 An An 

< — - + — 

3 18 

< HA n . 

This concludes the proof of the proposition. □ 

This proposition has the following important consequence. 

Corollary D.6. Under the assumptions of Proposition D.5 we have that rank(Lj'_ M ) = rank(L*) 
and that T(Lt m ) = Pm- Moreover, Lt m actually has the same inertia as L* . We also have that 
sign(Sn) = sign(S*). 

D.5 Removing the tangent-space constraints 

The following lemma provides a simple set of sufficient conditions under which the optimal solution 
(Sq,Lt m ) of (D.8) satisfies the optimality conditions of the convex program (D.3) (without the 
tangent space constraints). 

Lemma D.7. Let (Sq,Lt m ) be the solution to the tangent-space constrained convex program (D.8). 
Suppose that the assumptions of Proposition D.5 hold. If in addition we have that 

g 7 (AtRx h (A(A s ,A L )))< 



6(2 -z/)' 

then (Sci,Lt m ) is also the unique optimum of the convex program (D.3). 

Proof: Recall from Corollary D.6 that the tangent space at Lt m is equal to Tm- Applying the 
optimality conditions of the convex program (D.8) at the optimum (Sn, Lt m ), we have that there 
exist Lagrange multipliers Qq± £ Q , Q T ± € Tj^ such that 



£« - (S n - L^)- 1 + Q fiX e -X nl d\\Sn\\i, Eg - (Sn - ^J" 1 + Q r x € X n d\\L TM ||*. 

Restricting these conditions to the space y = f2 X Tmi one can check that 

Va[ZZ - (S n - L TM y l ] = -A n7 sign(5*), P Tm [E^ - (S n - L^)' 1 ] = X n UV T , 

where L Tm = UDV T is a reduced SVD of L Tm . Denoting Z = [-A n 7sign(S*), X n UV T ], we 
conclude that 

VyA^l - (Sn ~ Lt^ 1 ] = Z, (D.13) 

with g~/(Z) = X n . It is clear that the optimality condition of the convex program (D.3) (without 
the tangent-space constraints) on y is satisfied. All we need to show is that 

g^VyxA^T^o - (S n - L^y 1 }) < X n . (D.14) 
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Rewriting — (Sq — Lt m ) 1 in terms of the error (As, Ax,) = (Sq — S*, L* — Lt m ), we have 
that 



- (S n - L Tm )~ l = E n - R^ Q (A(A S , A L )) + TA(A S , A L ). 
Restating the condition (D.13) on y, we have that 

VyA ] Z* AVy {A S ,A L ) =Z + PyA*[-E n + Rv h (A(A s ,A L ))-X*C TM ]. (D.15) 

(Recall that Ct m = V T ± (L*).) A sufficient condition to show (D.14) and complete the proof of 
this lemma is that 

g y (Vy^rAVy(As,A L )) < X n - g y (V y ±A^[-E n + Rz h (A(A s , A L )) -X*C Tm ]). 

We prove this inequality next. Recall from Corollary D.4 that gy(A^T*Cr M ) < q^"-u) • Therefore, 
from equation (D.15) we can conclude that 

9l (VyA^X*APy(A S ,A L )) < \ n + 2{g 1 {X*[-E n + R^ o {A{As,A L ))-TC TM \)) 

3\ n v 



< A n + 2 
2X„ 



6(2 - u) 



Here we used the bounds assumed on g 1 (A ; E n ) and on g^(A^R^(A(As, A £,))). 
Applying the second part of Proposition 3.6, we have that 

g^r y ±A^*APy(A s ,A L )) < 2Xn 2 (1 _~ U) 





A n 


< 




< 




< 





2-u 
2(2 - v) 

g^[-E n + Rx h (A(A s ,A L )) -TC Tm \) 
g^Vy±Ai[-E n + Rx h (A(A s ,A L ))-l*C TM ]). 

Here the second-to-last inequality follows from the bounds on g 1 (A^E n ), g^(A^ R^(A{As, Al))), 

and g-f(A'I*Cr M ), and the last inequality follows from Lemma 3.4. This concludes the proof of the 
lemma. □ 

D.6 Probabilistic analysis 

All the analysis described so far in this section has been completely deterministic in nature. Here 
we present the probabilistic component of our proof. Specifically, we study the rate at which the 
sample covariance matrix converges to the true covariance matrix. The following result from [10] 
plays a key role in our analysis: 

Theorem D.8. Given natural numbers n,p withp < n, let T be a p x n matrix with i.i.d. Gaussian 
entries that have zero-mean and variance —. Then the largest and smallest singular values si(F) 
and Sp(T) ofT are such that 



max -J Pr 
for any t > 0. 



Si{T) > 1 + + 1] , Pr [s p (T) < 1 - - t] } < exp {-^} , 
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Using this result the next lemma provides a probabilistic bound between the sample covariance 
T,q formed using n samples and the true covariance Ej^ in spectral norm. This result is well- 
known, and we mainly discuss it here for completeness and also to show explicitly the dependence 
on ij) = ||Sq||2 defined in (3.6). 



Lemma D.9. Let i/j 

64pi/> 2 



that n > 



Sq||2- Given any 5 > with 5 < 8ip } let the number of samples n be such 
Then we have that 

Pr[||E5 - S5|| 2 >5]<2exp{- T g,}. 

Proof: Since the spectral norm is unitarily invariant, we can assume that Eq is diagonal without 

l l 

loss of generality. Let E n = (E5)~2£™ (£^)"2 , and let si(E n ), s p (E n ) denote the largest/smallest 
singular values of E n . Note that E n can be viewed as the sample covariance matrix formed from n 
independent samples drawn from a model with identity covariance, i.e., E n = rr T where F denotes 
apxii matrix with i.i.d. Gaussian entries that have zero-mean and variance -. We then have that 



Pr[||E5 



S0II2 > $] 



< Pr 

< Pr 



|E n 



> 



< 
< 



Pr 
Pr 
Pr 



*i(E B )>l + $ 
si(T) 2 >l + i 

si(T) > 1 + 



+ Pr 
+ Pr 
+ Pr 



^ P (r) 2 < 1 - 1 



_5_ 

lip 



s P (T) < 1 



_5_ 



Pr 



s P (T) < 1 
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Here we used the fact that n > 64 J;f in the fourth inequality, and we applied Theorem D.8 to 
obtain the final inequality by setting t = □ 

The following corollary describes relates the number of samples required for an error bound to 
hold with probability 1 — 2exp{— p}. 

Corollary D.10. Let Eq be the sample covariance formed from n samples of the observed variables. 



Set 5„ 



128pj) 2 



. If n > 2p, then we have with probability greater than 1 — 2exp{— p} that 



Pr [IIE 



o 



?oh <S n ] > l-2exp{-p}. 



Proof: We note that n > 2p implies that 5 n < 8^, and apply Lemma D.9. □ 



D.7 Putting it all together 

In this section we tie together the results obtained thus far to conclude the proof of Theorem 4.1. We 
only need to show that the sufficient conditions of Lemma D.7 are satisfied. It follows directly from 
Corollary D.6 that the low-rank part L? M is positive semidefinite, which implies that (S^i,Lt m ) 
is also the solution to the original regularized maximum-likelihood convex program (4.1) with the 
positive-semidefinite constraint. As usual set (As,A^) = (Sq — S*,L* — Lt m ), and set E n = 

Assumptions: We specify here the constants that were suppressed in the statement of Theo- 
rem 4.1: 
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1. Let CV = 32 (3- v)D mm { 3CT' 256D(3-i/)ipc' 1 } ' anc ^ ^ ^ e numDer of samples n be such that 



p f 128^ 2 , 



2. Set <5 n = y — and then set A n as follows: 

6D5 n (2 - u) 



Note that n > ■ 



Note that A n X J £ . 



3. Let the minimum nonzero singular value a of L* be such that 

where C5 is defined in Corollary D.4. Note that a > g(T) 3 \Zf" 

4. Let the minimum magnitude nonzero entry # of 5* be such that 

a > ggAn 

where Cq is defined in Corollary D.4. Note that 9 > ^ T ^^ y^f- 

Proof of Theorem 4.1: We condition on the event that H-Enlb < (5 n , which holds with 
probability greater than 1 — 2exp{— p} from Corollary D.10 as n > 2p by assumption. We note 
that based on the bound on n, we also have that 



Sn < 



av ( 1 av 
min ' 



32(3 -v)D \4Ci'256Z3(3-i/)^(7 1 2 
In particular, these bounds imply that 

<L < , V \ min <^ , — } (D.16 

" 32(3 - i/)£> \ 4Ci 64Dif)Cf J V ; 

and that 

* < " 2 g( T ) 2 ^ 2 

° n " 8192^C 2 (3-z.) 2 ,D 2 ' 1 j 

Both these weaker bounds are used later. 

Based on the assumptions above, the requirements of Lemma D.7 on the minimum nonzero 
singular value of L* and the minimum magnitude nonzero entry of S* are satisfied. We only need to 
verify the bounds on A n and g~(JvE n ) from Proposition D.5, and the bound on g-y(A.' RA(A.g , A^)) 
from Lemma D.7. 
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First we verify the bound on X n . Based on the setting of A n above and the bound on 5 n from 
(D.16), we have that 



< 



6D(2 - u)5 n 

3q(2 - v) 
16(3 - u) 



mm 



Next we combine the facts that A r 



QD5 n (2-v) 



(J; 



D8 n 



< 



, and that ||-E/ n ||2 < S n to conclude that 



£(T) 6(2-!/)' 



Finally we provide a bound on the remainder by applying Propositions D.2 and D.l, which 
would satisfy the last remaining condition of Lemma D.7. In order to apply Proposition D.2, we 
note that 



a l 



g 7 (A^E n )+g 7 (A^C TM ) + X ri 



< - 

a 



3(2 - u 



+ 1 



16(3 - u)X n 
3a(2 - v) 
32(3 -u)D 



< min 



i «e(r) 



(D.18) 



4Ci UD^Cf 



In the first inequality we used the fact that g 1 {A^E n ) < ^-v) (f rom above) and that g 1 {A^X*CT M ) 
is similarly bounded (from Corollary D.4 due to the bound on a). In the second equality we used the 



relation A n 



6D8 n {2-u) 



. In the final inequality we used the bound on S n from (D.16). This satisfies 



one of the requirements of Proposition D.2. The other condition on \\Or M 1 1 2 is also similarly satisfied 
due to the bound on a from Corollary D.4. Specifically, we have that \\Ct m H2 < ^a^-l^)" f rom 
Corollary D.4, and use the same sequence of inequalities as above to satisfy the second requirement 
of Proposition D.2. Thus we conclude from Proposition D.2 and from (D.18) that 



/a . \ 64(3 -v)D . 
9l A 5 ,A L < \ 1 S n . 



(D.19) 



This bound implies that g 7 (As, A^) < |^jv/n' wmc h proves the parametric consistency part of 
the theorem. 

Since the bound (D.19) also satisfies the condition of Proposition D.l (from the inequality 
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following (D.18) above we see that g^(As, Ax) < ^r-), we have that 



g^R(A s + A L )) < ^1L 57 (A 5 ,A L ) 2 



< 



< 



2DipC% /64(3-i/)DV 2 



8192^(3 -z/) 2 L> 2 



DSn 



DSn 

6(2 - i/) 



In the final inequality we used the bound (D.I7) on 5 n , and in the final equality we used the relation 



An 



6D8 n (2-u) 



This concludes the algebraic consistency part of the theorem. □ 
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